identifying social markers from network data based...

133
IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED ON LOCATION, MOBILITY AND PROXIMITY By UDAYAN KUMAR A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2012

Upload: others

Post on 15-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED ON LOCATION,MOBILITY AND PROXIMITY

By

UDAYAN KUMAR

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2012

Page 2: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

c⃝ 2012 Udayan Kumar

2

Page 3: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

A Sanskrit saying which means “Good rapport and friendship develops among thosewho share a similar outlook on life and hobbies. Thus, deer flock with deer, cows withcows, and horses with horses. In the same manner, fools frequent fools and the wise

bond with the wise.”

3

Page 4: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

ACKNOWLEDGMENTS

From my initial thoughts of pursuing a PhD to actually finishing it, almost every step

was confusing and the end of tunnel was never visible. However, the constant support

and encouragement I received along the way kept the hope alive; one by one all the

pieces of the puzzle fell into place. Reflecting back, I feel that there are more people to

acknowledge than I can possibly remember. Whether I think about my family members,

teachers, friends, and even random strangers who were tolerant enough to listen to

my crazy ideas and give me their point of view. However, if I look at a broader level,

I would like to thank not only these people but also their parents because obviously

these people are here because of their parents. But their parents also had parents

who themselves had parents, so I would like to thank everyone on this chain going

backwards all the way upto the first living creature on Earth. I am also thankful to the

creator of life of the Earth and the creator the Earth. Obviously life would not have

been possible without the creation of Sun and rest of the Universe. So I want to thank

the creator of the Universe. But this makes me wonder why would anybody undertake

such a giant enterprise? That is creating the whole universe, with billions and billions

of galaxies, stars, planets and life forms. May be this is someone’s PhD project. So, am

I a simulation object? In that case, I want to withdraw all my thanks, this was anyways

supposed to happen!

4

Page 5: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 USER CLASSIFICATION AND FEATURE EXTRACTION FROM WLAN TRACES 20

2.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.1.1 Location Based Classification (LBC) . . . . . . . . . . . . . . . . . 24

2.1.1.1 Individual Behavior based Filtering (IBF) . . . . . . . . . 252.1.1.2 Group Behavior based Filtering (GBF) . . . . . . . . . . . 262.1.1.3 Hybrid Filtering (HF) . . . . . . . . . . . . . . . . . . . . 32

2.1.2 Name Based Classification (NBC) . . . . . . . . . . . . . . . . . . 332.2 Validation of Location Based Classification . . . . . . . . . . . . . . . . . 34

2.2.1 Temporal Consistency Validation Using Adjacent Months . . . . . . 352.2.2 IBF vs GBF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.2.3 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.3 User Behavior Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.3.1 User Spatial Distribution . . . . . . . . . . . . . . . . . . . . . . . . 392.3.2 Average Duration or Temporal Analysis . . . . . . . . . . . . . . . . 412.3.3 Device Preference . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.4.1 Mobility Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.4.2 Protocol Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.4.3 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.4.4 Resource Management . . . . . . . . . . . . . . . . . . . . . . . . 46

2.5 Conclusion And Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 47

3 BREAKING ANONYMITY IN WLAN TRACES . . . . . . . . . . . . . . . . . . . 49

3.1 Information In WLAN Traces . . . . . . . . . . . . . . . . . . . . . . . . . . 513.2 Need For Anonymity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.4 Attack Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.4.1 Identify Your Own MAC In Trace . . . . . . . . . . . . . . . . . . . . 573.4.2 Identifying Building Codes . . . . . . . . . . . . . . . . . . . . . . . 573.4.3 Identifying A Person . . . . . . . . . . . . . . . . . . . . . . . . . . 583.4.4 Multiple Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5

Page 6: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

3.5 Analysis and Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.5.1 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 613.5.2 Practical/Trace Analysis . . . . . . . . . . . . . . . . . . . . . . . . 62

3.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 65

4 AN ENCOUNTER-BASED FRAMEWORK FOR TRUST . . . . . . . . . . . . . 67

4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.2 Architectural overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.2.1 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.2.2 Overall Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.3 Trust Adviser Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.3.1 Aggregation Based Similarity . . . . . . . . . . . . . . . . . . . . . 75

4.3.1.1 Frequency of Encounters (FE) . . . . . . . . . . . . . . . 754.3.1.2 Duration of Encounters (DE) . . . . . . . . . . . . . . . . 76

4.3.2 Behavior Based Similarity . . . . . . . . . . . . . . . . . . . . . . . 764.3.2.1 Profile Vector (PV): . . . . . . . . . . . . . . . . . . . . . 764.3.2.2 Location Vector (LV): . . . . . . . . . . . . . . . . . . . . 774.3.2.3 Behavior Matrix (BM) . . . . . . . . . . . . . . . . . . . . 77

4.3.3 Hybrid Filter (HF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.4 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.4.1 Detection Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.4.2 Attacker Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.5 Trace Based Evaluation and Analysis . . . . . . . . . . . . . . . . . . . . 864.5.1 Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.5.2 Filter Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.5.2.1 Statistical Characterization . . . . . . . . . . . . . . . . . 904.5.2.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . 904.5.2.3 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.5.2.4 Graph Analysis . . . . . . . . . . . . . . . . . . . . . . . . 914.5.2.5 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . 92

4.5.3 Selfishness & Trust Routing in DTN . . . . . . . . . . . . . . . . . . 934.6 Survey and Implementation Based Validation . . . . . . . . . . . . . . . . 96

4.6.1 Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 964.6.2 iTrust Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.6.2.1 Application Evaluation: . . . . . . . . . . . . . . . . . . . 1004.6.2.2 Energy Efficiency . . . . . . . . . . . . . . . . . . . . . . 1054.6.2.3 Location estimation . . . . . . . . . . . . . . . . . . . . . 105

4.7 Discussion: Other Trust Inputs . . . . . . . . . . . . . . . . . . . . . . . . 1064.7.1 Blacklist & Whitelist . . . . . . . . . . . . . . . . . . . . . . . . . . . 1064.7.2 Recommendation & Reputation Systems . . . . . . . . . . . . . . . 1074.7.3 Contextual & Event Information . . . . . . . . . . . . . . . . . . . . 1074.7.4 Combined Trust Recommendation . . . . . . . . . . . . . . . . . . 108

4.8 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 109

6

Page 7: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

5 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . 112

APPENDIX

A CODE SNIPPETS FROM iTrust APPLICATION . . . . . . . . . . . . . . . . . . 115

A.1 Energy Efficient Scanning . . . . . . . . . . . . . . . . . . . . . . . . . . . 115A.2 Calculating LV ( Sec. 4.3.2) . . . . . . . . . . . . . . . . . . . . . . . . . . 116

B ENERGY EFFICIENT DEVICE DISCOVERY . . . . . . . . . . . . . . . . . . . 118

B.1 Available Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118B.2 Evaluations Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118B.3 Current Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

B.3.1 Combining WiFi And Bluetooth Scanning . . . . . . . . . . . . . . . 120B.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

C USER BEHAVIOR ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123C.0.1 Spatial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 123C.0.2 Temporal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 123

D SURVEY FORM - iTrust VALIDATION . . . . . . . . . . . . . . . . . . . . . . . 126

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

7

Page 8: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

LIST OF TABLES

Table page

2-1 Average Silhouette Width for Sorority and Fraternities from University U1 andU2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2-2 Results of classification of users from U1 (LBC) and U2 (NBC). ‘Common’signifies the users which were common to both male and female population. . 34

2-3 Similarity in the user population selected after filtering fraternity users for U1 . 36

2-4 Similarity in the user population selected after filtering sorority users for U1 . . 37

2-5 Validation - comparing users selected by IBF and GBF for U1 . . . . . . . . . . 37

2-6 Cross validation of LBC by NBC for U2 . . . . . . . . . . . . . . . . . . . . . . . 38

3-1 WLAN trace sample: before and after anonymization . . . . . . . . . . . . . . . 51

3-2 Fields present in each record of wired trace, basically a IP-Header . . . . . . . 53

3-3 Result of finding users with similar location visiting sequences with varyingduration of the trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4-1 Overhead of Filters in terms of processing and storage. Here m is the totalno. of records in the encounter file, n is the no. of unique encountered user, lis no. of locations visited d represents the no. of days used for BM calculations.We also assume that m >> n. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4-2 Facts about studied traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4-3 False positives and negatives while using the proposed anomaly detection (inpercentage) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

B-1 Accuracy Loss using traces for 20 users, EE4 means 4 times the minimumscan period is the upper bound of scan interval, similarly in EE8, the upperbound on skip period is 8. This result used Bluetooth traces only. Lesser valuesis better . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

B-2 Scan Efficiency using traces for 20 users, EE4 means 4 time the minimumscan period is the upper bound of scan interval, similarly in EE8 & EE16 its 8& 16 times respectively. This result used Bluetooth traces only. Higher valueis better . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

B-3 s/e ratio for Star, MIMD and FIBO algorithms . . . . . . . . . . . . . . . . . . . 121

B-4 Combining Wi-Fi and Bluetooth scanning . . . . . . . . . . . . . . . . . . . . . 122

C-1 Spatial Distribution of Users at U2 . . . . . . . . . . . . . . . . . . . . . . . . . 124

8

Page 9: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

C-2 Spatial Distribution of Users at U1 . . . . . . . . . . . . . . . . . . . . . . . . . 124

C-3 Average Duration of Users at U2 . . . . . . . . . . . . . . . . . . . . . . . . . . 124

C-4 Average Duration of Users at U1 . . . . . . . . . . . . . . . . . . . . . . . . . . 125

9

Page 10: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

LIST OF FIGURES

Figure page

2-1 Query based user grouping technique . . . . . . . . . . . . . . . . . . . . . . . 22

2-2 A sample trace database snapshot . . . . . . . . . . . . . . . . . . . . . . . . . 22

2-3 Gender grouping in Fraternities and Sororities . . . . . . . . . . . . . . . . . . . 24

2-4 Session count for fraternity and sorority users . . . . . . . . . . . . . . . . . . . 27

2-5 Session count for fraternity and sorority users . . . . . . . . . . . . . . . . . . . 28

2-6 Session count for fraternity and sorority users . . . . . . . . . . . . . . . . . . . 30

2-7 Session count for fraternity and sorority users . . . . . . . . . . . . . . . . . . . 31

2-8 Session count for fraternity and sorority users . . . . . . . . . . . . . . . . . . . 31

2-9 Session count for fraternity and sorority users . . . . . . . . . . . . . . . . . . . 32

2-10 Comparison of user distribution across the university U1 campus (in Percentage) 40

2-11 Comparison of user distribution across the university U2 campus (in Percentage) 41

2-12 Average duration of male and females in different Areas of university U1 campus 42

2-13 Average duration of male and females in different Areas of the university U2campus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2-14 Device distribution by manufacturer at university U1 . . . . . . . . . . . . . . . 44

2-15 Device distribution by manufacturer at university U2 . . . . . . . . . . . . . . . 45

3-1 Attacker capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3-2 Percentage of no. of users found, when 111 filters based on gender+major+manufacturerare applied . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3-3 UL at n = 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3-4 Results of the combination generation and sequence matching for randomlychosen 230 users out of 27K users belonging to the month of Nov 2007. Thisgraph shows Pi and ni . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4-1 Block Diagram overview of the iTrust architecture. Dotted lines indicate modulesneeded by iTrust. Shaded blocks indicate modules discussed in this work. . . . 73

4-2 Location Vector LV for a user . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4-3 Behavior Matrix for a user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

10

Page 11: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

4-4 The growth of trust score using FE filter for a specific user. Each line correspondsto an encounterd user. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4-5 The growth of trust score using FE filter using the attacker model. Each linecorresponds to an instance of attacker generated by the model. . . . . . . . . . 85

4-6 Similarity score for various filter for all the encountered pairs of users in Nov2007 from U1 trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4-7 Correlation between the trusted lists produced by various filters at T=40% . . . 88

4-8 Comparison of trust list belonging to different history for various filters at T=40%(note that the y-axis scale for DE , FE , and LV − C starts at 85% and for LV −D and BM the scale starts at 35%) . . . . . . . . . . . . . . . . . . . . . . . . . 89

4-9 Normalized Clustering Coefficient and Normalized Path Length . . . . . . . . . 92

4-10 Flow chart for iTrust routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4-11 Average unreachability with varying Trust and Selfishness using DE filter . . . 97

4-12 Hybrid filter results when T=40%. Number on the legend indicated the ratio ofscore from each filter. For e.g. 1211 implies αDE = 0.2, αFE = 0.4, αLV−D =0.2, and αBM = 0.2 and 0100 implies αDE = 0, αFE = 1, αLV−D = 0, andαBM = 0 (Sec. 4.3.3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4-13 Survey Results showing user’s propensity to communicate with other users invarious communication scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4-14 Illustration of iTrust’s component and their interactions . . . . . . . . . . . . . . 99

4-15 Screenshots of iTrust application. Fig. A shows the main screen where encounterusers are sorted by the filter score. Current encounters marked with Greencircles. Trusted users are shown in Blue color. Fig. B shows details for an encountereduser. Fig. C shows user encounters on Map. Fig. D shows the registrationscreen for optional users information discovery service. Fig. E shows screenwhere display order of encountered users can modified. Fig. F shows the screento select weights for the Hybrid filter (in the app it is referred as combined filter).Fig G. Shows the screen where user can check self statistics regarding encounters.It also shows the number of scans saved due the use of energy efficient scanner.Fig H. Shows the menu. Menu allows the user to jump from one screen toanther. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

11

Page 12: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

4-16 Continuation of screenshots of iTrust application. Fig. A shows the settingsscreen. Fig. B shows number of encounters the user had with a particularuser over a period of time. This feature allows a user to know more about encounteringusers. Fig. C shows a graphs from the Self-Stat screen of the application.Here the graphs show the total number of encounter this user had with respectto time. Fig. D shows the about page with author information and web link foriTrust. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4-17 iTrust evaluations based on application usage. Fig. A shows the percentageof trusted users in 1 to 10 Top user, 11 to 20 Top users for each filter. Fig.Bshows the percentage of total trusted users in Top 1 to 10, 11 to 20, etc. FigC. shows fraction of encounter users needed (from top) to capture ‘x’% of trustedusers for each filter. Fig D. shows the Normalized Discount Cumulative Gainscore for iTrust recommendations. . . . . . . . . . . . . . . . . . . . . . . . . . 103

A-1 Evolution of features in the iTrust app based on feedback from user. . . . . . . 115

12

Page 13: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED ON LOCATION,MOBILITY AND PROXIMITY

By

Udayan Kumar

December 2012

Chair: Ahmed HelmyMajor: Computer Engineering

The ubiquitous spread of mobile devices, global connectivity and tight coupling of

mobile phones with the users has lead to an era, where mobile phones have become

alter ego of the users. Mobile devices accompany users to places where not even the

closest of family and friends are allowed (e.g. office, meetings, conferences among other

places). The access to these movement and network access logs from mobile devices

can shed light on human behavior, which in turn can be used to solve several research

challenges.

In this work, we present our measurements, analysis and designs obtained by

utilizing network traces collected at both personal and group level. We have used

network traces from several thousands of devices to understand, identify and extract

social markers or characteristics. The social markers we have studied include social

grouping based on gender, proximity-based trust and the difficulty of anonymizing traces

because of mobility.

In the first part, we discuss how social-grouping information can be extract from

anonymized network traces. Using a gender-based case study, we demonstrate our

approach, along with different methods to validate the results. In the second part, we

study the fundamental trade off between the utility of WLAN traces and privacy of the

users. We show how privacy of users in anonymized traces can be compromised. In the

third and the final part, we implement, and evaluate an effective framework to establish

13

Page 14: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

trust in mobile networks through a protocol that we call iTrust. We present results of

our trace based-analysis and of user-study based on the deployment of iTrust mobile

application.

14

Page 15: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

CHAPTER 1INTRODUCTION

The ubiquitous spread of mobile devices, global connectivity and tight coupling of

mobile phones with the users has lead to an era, where mobile phones have become

alter ego of the users. Mobile devices accompany users to places where not even the

closest of family and friends are allowed (e.g. office, meetings, conferences among

other places). This tight coupling can be used to not only provide connectivity to the

Internet but also to provide personalized services based on the behavior patterns of

the user. An example application can be a game application that customizes itself

based on the free time a user has. Lets say that a user always commutes to work

using public transport and on the way uses the mobile device to play games. A device

that can detect this context can pass on the approximate duration of the commute to

the game application, allowing the game application to generate a game that can be

finished during the commute. Here the phone was reading the sensors such as GPS

and accelerometer to inferring user context. The general idea is that we can use/design

sensors that can sense everything experienced by a user. Once we have these sensor

readings, everything presented to a user can be customized.

Applications in the above example were based on the behavior sensing from a

single device, what if we have access to sensor information from all the devices? Can

we predict traffic congestion even before it happens by considering the total number

of people heading towards the freeway. Can we study the movement patterns of the

population to predict the spread of infectious diseases. Can we guess the type of

relationships existing between a pair of users? Several crowd sourcing applications have

been developed to collect data from a large pool of users (if not all) to get a global view.

The challenges that still remain even after having access to this kind of data include

handling this data (scale of data can be huge, imagine that a data tuple is generated

for every person at every minute), developing algorithms that can generate meaningful

15

Page 16: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

inferences, and implications over the privacy of the users. In many cases, obtaining

inferences is challenging due to non-existence of standard models and theorem that can

relate sensor readings with human behavior characteristics. It is also difficult to validate

and verify inference propositions as it is a challenge by itself to access the ground truth

because of the scale data collection (from several hundreds to billions of users).

In this work, we present our measurements, analysis and designs obtained by

utilizing network traces collected at both personal and group level. We have restricted

ourselves to consider analysis and design based on location, mobility and proximity

features. We have attempted to related these features to social characteristics or

markers including social grouping based on gender, homophily based trust and the

implications on anonymization of traces due to mobility. Due to the lack of any suitable

verification methods, we have developed our own verification methods. We perceive that

several application can benefit from our analysis techniques and results, but also from

the verification methods developed in this work.

The access to the social markers and context can allow researchers to understand

and model characteristics of human behavior, create new services, make applications

context aware among other possibilities. For example, it has been shown how the

random mobility model does not capture the actual human mobility [16] and having

information on social and community structures can create better mobility models [37].

Recently, researchers have shown how context can be sensed and used to provide new

applications [51]. In last part of this work, we present how social science’s principle of

homophily can be measured using mobile devices and how that can be used to generate

trust in the network. The understanding of social data and user behavior has lead to

development of a new field of study called Computational Sociology [52] [71].

In this work, we present methods to extract social markers such as gender based

grouping and proximity-based trust. The challenge we faced in accomplishing this task

include non-availability of any kind of personal information about the users (mainly due

16

Page 17: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

to anonymization for privacy protection). Thus we have not only developed methods to

extract this kind of information from anonymized traces (data-sets), we also developed

methods to validate our results in absence of ground truth. This work is divided primarily

into three main topics: 1. user classification into groups with case study on gender

based grouping, 2. Challenges of anonymization of traces due to mobility, 3. Proximity

based trust.

In the study on user classification, we present two novel scientific techniques

to classify WLAN users into social groups. The first technique uses mapping of the

traces into buildings (e.g., dept. buildings, libraries, sororities and fraternities) to extract

affiliation and gender information based on network usage statistics. The second

technique utilizes directory (phone-book) information that can be linked to WLAN users

to extract useful information. For example, usernames of the WLAN users (if available)

can be used to find user’s gender based on first name and databases. As a case study

we perform classification and behavior analysis of users by gender. Extensive WLAN

traces from two major universities are collected over three years and analyzed. Results

from both the methods are cross-validated and show more than 90% correspondence.

Comparing usage patterns provided interesting results including males spend more time

online that females and females prefer Apple computers over PCs.

In the second part we study the fundamental trade off between the utility of WLAN

traces and privacy of the users. The study provides several realistic case studies in

which privacy attacks may be conducted. We then provide an analysis of these attacks

and drawbacks of the existing anonymization techniques. Our initial quantitative analysis

to estimate mobile users’ k-anonymity in WLAN traces shows surprisingly unique usage

patterns, which may compromise anonymity. The main contribution of this work is to

articulate the compelling challenges facing anonymization of wireless networks traces

and to shed light on the answer to an intriguing question: Just how private are wireless

networks traces?

17

Page 18: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

For the third part, we implement, and evaluate an effective framework to establish

trust in mobile networks through a protocol that we call iTrust. The goal of iTrust is

to provide accurate and robust trust scores to encountered devices, in an efficient,

privacy-preserving and resilient manner. We borrow from the social science principle

of homophily ; a tendency of individuals to interact and trust similar others. We

introduce and analyze a family of encounter-based trust adviser filters that make trust

recommendations based on encounter frequency, duration, location behavior-vector,

and location preference behavior-matrix. We present a proof of concept application for

Android and Linux-based mobile devices. We also conduct a user study to validate the

trust recommendations generated by iTrust. With this trust, several potential applications

can be enabled including mobile social networking, building groups and communities of

interest, localized alert and emergency notification, context-aware and similarity-based

networking.

The contributions of this work can be categorized into two components, 1.

Intellectual contributions and 2. Effort contributions.

Intellectual contributions include:

1. Applied existing data mining techniques to classify network users into socialgroups. Created methods to statistically validate the results in the absence ofground truth.

2. Identified techniques to break anonymization of the traces by capitalizing on themobility of a user.

3. Introduced methods to infer/recommend trust/friendship among encountering user,using several outlooks. We proposed several privacy preserving filter or metricsthat can be used to measure similarity.

Effort contribution include:

1. iTrust and Profile-Cast implementation on mobile devices.

2. Collection of UF wireless network traces

3. Creation of Bluetooth trace library.

18

Page 19: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

4. Developed several basic building blocks like scanner, parsers for Android, Nokiaand Openmoko platform.

In the following sections we present each piece of work in detail, starting with user

classification into groups, then anonymization and finally discussing proximity based

trust.

19

Page 20: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

CHAPTER 2USER CLASSIFICATION AND FEATURE EXTRACTION FROM WLAN TRACES

In future mobile networks, with many hand held devices tightly coupled with a user,

communication performance is bound to user mobility and behavior. This applies to

various kinds of mobile networks, including cellular networks, but more particularly

ad-hoc and delay tolerant networks (DTNs), because every node may act as a router

and the network may be infrastructure-less. In such an environment, it is imperative

to understand the various aspects of user behavior, including mobility, commonalities,

differences in preference, and net activity between classes of users, in order to design

efficient protocols and effective network services.

We propose a new approach to classification and feature analysis of user behavior

based on social grouping, using a set of techniques which can be used to provide

information about a user from social perspective. The best source of information about

real user mobility and network usage comes from WLAN (Wireless LANs) traces. These

traces have been used in many studies whenever real user data is required. They

have been previously used to validate mobility models [37, 65] and understand user

associations [36] among other usages. We, in this work, propose to use WLAN traces

(generally considered for studying network characteristics) to mine social behavior of

the users based on gender, majors, and other interest groups. We present a general

methodology with an example case study of grouping by gender, and investigate gender

gaps in WLAN usage. The lack of such empirical data poses an interesting challenge

and raises several research (and privacy) questions: How can we meaningfully infer

gender information from such anonymous traces? Does gender information influence

user behavior and preference in a significant and consistent manner? Finally, what is the

impact of these finding on network modeling, protocol and service design in the future?

Gender based studies have been conducted in the past to study issues such as

difference in technology adoption for the wired Internet [30]. This paper is the first,

20

Page 21: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

to our knowledge, to scientifically analyze WLAN usage patterns in mobile societies

across user groups. Our study begins by introducing a location-based method for

gender classification on campus. It provides robust filters, based on individual and

group network behavior, in addition to clustering techniques, to identify males and

females with high confidence. We analyze extensive Wireless LAN traces collected for

over 3 years from 2 major universities covering more than 50,000 users. The findings

are cross validated with ground truth from Name based method and yield over 90%

success. Once the gender classification is performed, a thorough investigation of the

spatio-temporal characteristics of the gender based network activity is conducted.

Among the parameters we have considered for evaluating the gender gaps, we found

enough statistical evidence to conclude that (for the traces used in our study) usage

patterns of males and females are different, and that gender does affect user activity

and vendor preference. We believe that such attributes will certainly enhance the

understanding of the mobile society and is essential to provide efficient network

protocols and services in the future. Our findings also indicate that the problem of

mobile user privacy should be re-visited.

Contributions: This paper provides following contributions: i. class and gender

inference methods based on location, usage and name filtering from extensive WLAN

traces, ii. providing the first gender-based trace-driven analysis in mobile societies,

including study of majors and device preferences, iii. identifying unique features in the

studied grouping that suggests consistent behavior and the design of potential future

applications.

The rest of the paper is outlined as follows: Sec. 2.1 discusses multiple techniques

for user classification, followed by Sec. 2.2, which provides several methods for

validating the classification. Sec. 2.3 provides the gender-based feature analysis

and results and Sec. 2.4 discusses potential applications. Conclusion and the future

work is presented in Sec. 2.5.

21

Page 22: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Figure 2-1. Query based user grouping technique

Figure 2-2. A sample trace database snapshot

2.1 Approach

In this work, we consider WLAN traces to understand usage characteristics/behavior

pattern of social groups. WLAN traces are logs of user association with a Wireless

Access Point (AP). Traces generally contain machine’s MAC address, associating time,

duration and associated AP. MAC address is always anonymized to protect privacy of

the user. How can we begin to classify all the students into social groups like gender

and study major using only the publicly available information and traces [7][41]? Having

a meaningful classification with this partial information is the main challenge that we

address in this work. Ideally, we would want to classify all users into groups. Taking a

first step in this direction we present a general technique, which can be used to classify

a smaller section of WLAN users into groups. Doing it for all the users still remains a

challenge as we shall see. Instead, we focus on obtaining a sample significant enough

for a statistical analysis.

22

Page 23: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Our technique works on raw WLAN SNMP and SYSLOG traces. The traces are

accumulated for a time period and parsed into a standard format as shown in Fig. 2-1.

We use the location information of the APs, in the form of buildings in which they are

located. This helps to identify the geographic locations of a user at a later stage. Mobility

of users can be tracked by looking at the approximate geographic locations of the APs.

The processed data is fed into a database on which SQL queries can be run easily

(and generically) to extract information of interest. Fig 2-2 illustrates the generic trace

database layout, which is used in our experiment. The fields include the following: 1.

anonymized MAC addresses of the wireless devices logged onto the WLAN, 2. the

session start time (in seconds), 3. the AP with which the wireless device associated, 4.

Duration of the association with the AP, 5. the manufacturer of the wireless card (which

we inferred from partial MAC address), and 6. the building at which the AP is located

(inferred based on a map), this field is external to the actual traces. Two-dimensional

co-ordinates can be inbuilt into the database based on a campus grid map to allow

mobility based queries to be performed as well. In some cases, if more information such

as usernames are available, we can add more fields to the database. The advantage

of having a standard schema for the database is that similar queries can be used on

traces coming from multiple sources. We have used this same database framework to

analyze traces from USC[41], Dartmouth [7], UF and UNC[3]; the method is general and

applicable to many traces (campuses and urban) and several grouping criteria.

Trace collection process, environment, and anonymization used have a great impact

on the utility of the traces and since traces coming from different sources may have

totally different processing and information. Its very difficult to find one general method,

which would classify users in all settings, therefore we propose multiple methods. As it is

very difficult to get hold of this data, it is even more difficult to validate it. We have used

several statistical methods to give us confidence in the classification and cross-validated

with name-based approach; closest possible to the ground truth at a large scale.

23

Page 24: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Figure 2-3. Gender grouping in Fraternities and Sororities

We use traces from two universities, U1 and U2 (names withheld for privacy

reasons) that provide information as shown in Fig. 2-2 except that university U2 trace

also provides the usernames. Traces from U1 belong to Feb 2006, Oct 2006 and Feb

2007, and Traces from U2 belong to Nov 2007 and Apr 2008. The grouping parameter

we use in this work for investigation is gender based. To do this categorization, we

propose two novel techniques: Location based Classification (LBC) and Name based

Classification (NBC), and subsequently, we examine and discuss their advantages.

2.1.1 Location Based Classification (LBC)

Most US universities have sororities (female organizations) and fraternities (male

organizations) as social organizations. The buildings, which houses these organizations

also serve as residences for most of the members. Given the physical location of APs

on campus, APs located in sororities and fraternities are identified, and the users

associated with them are classified as females or males respectively. Fig. 2-3 illustrates

how grouping is done in this setting. This method can also be used to classify users

by other grouping criteria such as study major. For example all users associating with

Computer Science building AP can be classified as Computer Science major students.

Since wireless networks may be used by anyone in the physical proximity to the AP, this

kind of classification will also have un-related users or visitors accessing these APs,

24

Page 25: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

which can make the classification inaccurate. We next present techniques to filter out

regular users from visitors at an AP.

Filtering: LBC requires filtering, as fraternities and sororities have male and

female visitors. Without further refinements and filtering, this method would not be

accurate. But even if we validate the presence of visitors, how can we filter them from

our classification? First, visitors are infrequent users of the mobile network in the visited

locations. Second, we expect a significant difference between residents and visitors

in terms of network activity (in number and duration of on-line sessions). Third, a user

who is visitor at one location can be a regular user at some other location. Hence,

we can define a visitor as a user with less number of sessions and smaller duration

of sessions than the average user in that location (group behavior) or as user who

has more sessions and larger online duration at other locations (individual behavior).

Our filtering techniques rate users based on two metrics: the number of sessions and

session duration. Once we rate all the users on these two metrics, we apply cut-off

thresholds to determine regular users. Filtering can be performed on these ratings

considering individual and/or group behavior as described in rest of the section.

2.1.1.1 Individual Behavior based Filtering (IBF)

In Individual Behavior based Filtering (IBF), we find the probability of a user being

male or female by counting the number of sessions and measuring the duration he/she

spends in fraternities versus sororities. This can be done using the equations below.

The probability of a user being male, considering only session counts at fraternities

and sororities is given by:

PCM(u) = Cf (u)Cf (u)+Cs(u)

where function Cf gives session count for user u in fraternities and function Cs gives the

session count for user u in sororities.

Similarly, the probability of a user being male, considering only session durations at

fraternities and sororities is given by:

25

Page 26: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

PDM(u) = Df (u)Df (u)+Ds(u)

where function Df gives the total duration of sessions for user u in fraternities and

function Ds gives the total duration of sessions for user u in sororities.

Fig. 2-4 shows users who visited fraternity and/or sororities in decreasing order

of PCM(u) and PDM(u) for traces from university U1. Interesting observation is that

both PCM and PDM follow a similar trend and there is a sudden drop (transition) from

1 to 0 (between 500th and 700th user), essentially separating males from females. In

Fig. 2-4A, Out of 1119 users, there is a large number (∼ 425) of users whose probability

of being male is 1. These users have never associated with sororities APs. We also

have large number (∼ 362) of users who have never associated with fraternities AP

(PCM = 0 and PDM = 0), who we can classify as females. As fraternities and sororities

have visitors, many males will have probability less than 1 (vice-versa for females), if

we only consider users with probability 1 or 0, we would considerably remove legitimate

users who have visited and used WLAN at other locations (sororities for males and

fraternities for females).

We have instead classified all the users having PCM > 0.80 and PDM > 0.80 as

males and PCM < 0.20 and PDM < 0.20 as females, using the 80-20 rule or the Pareto

principle such that 80% of the regular users should fall in top 20% probability. Other

users discarded from the our studies. The results from University U2 are also similar

(2-5). This method, IBF, is generic and can also be used in other grouping criteria such

as study major among others.

2.1.1.2 Group Behavior based Filtering (GBF)

In Group based Filtering (GBF), we filter a user based on where his usage pattern

lies with respect to all the users at a particular location. GBF is also useful when traces

are available only from limited number of buildings and we cannot use IBF due to lack

of traces from all the buildings. For example lets consider that at a particular location,

we discover that average session duration of regular users is 3000sec and their session

26

Page 27: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

0 200 400 600 800 1000 1200

0.0

0.2

0.4

0.6

0.8

1.0

Pro

babi

lity

of b

eing

Mal

e

Users in decreasing order of their Male probability (U1 feb2006)

PCM PDM

0 200 400 600 800 1000 1200 1400

0.0

0.2

0.4

0.6

0.8

1.0

Pro

babi

lity

of b

eing

Mal

e

Users in decreasing order of their Male probability (U1 Oct2006)

PCM PDM

0 200 400 600 800 1000 1200

0.0

0.2

0.4

0.6

0.8

1.0

Pro

babi

lity

of b

eing

Mal

e

Users in decreasing order of their Male probability (U1 feb2007)

PCM PDM

Figure 2-4. Users Vising Fraternity and/or Sorority in decreasing order of their Maleprobabilty at University U1. A) Feb 2006. B) Oct 2006. C) Feb 2007.

count is 10 in a period of one month. So all users who at least meet these criteria can

become regular users and are classified as male or female based on the location,

everyone else is considered a visitor and therefore removed. Finding these thresholds is

not a trivial task as these thresholds would vary from building to building and may also

change with time. For this task we employ clustering techniques [18] (one of the key

methods for unsupervised learning) to partition our data into regular users and visitors.

Clustering: Clustering can be used to divide a set of users into several subsets

such that users in each subset are most similar based on WLAN usage metrics

(duration, session count, distinct login days). From two general category of clustering

27

Page 28: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

-200 0 200 400 600 800 1000 1200 1400 1600 1800 2000

0.0

0.2

0.4

0.6

0.8

1.0

Pro

babi

lty o

f bei

ng M

ale

Users in decreasing order of their Male probability (U2 Nov 2007)

PCM PDM

A

0 500 1000 1500 2000 2500

0.0

0.2

0.4

0.6

0.8

1.0

Pro

babi

lty o

f bei

ng M

ale

Users in decreasing order of their Male probabilty (U2 Apr 2008)

PCM PDM

B

Figure 2-5. Users Vising Fraternity and/or Sorority in decreasing order of their Maleprobabilty at University U2. A) Nov 2007. B) Apr 2008.

algorithms; namely hierarchical and partition scheme, we choose a robust partitioning

method called Partitioning Around Mediods (PAM) [45]. This method has distinct

advantages (over standard k-means [18]) in that it uses dissimilarity score to minimize

dissimilarity in the same cluster, making clusters robust to outliers. It also provides

a novel method called Silhouette Widths and Plots for estimating cluster quality. The

average Silhouette Widths are useful in estimating the number of clusters present in

the data (often a challenging job in cluster analysis). One has to run PAM several times,

28

Page 29: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

each time for different number of clusters and then compare the resulting Silhouette

Widths. The clustering size that produces maximum average width is the best clustering

possible. The average width can also be used to estimate the quality of the clustering;

above 0.70 for strong clustering, between 0.50 – 0.70 for a reasonable structure and

below 0.50 for weak structure [45].

We use PAM to distinguish visitors from regular users (i.e residents). We use

number of distinct days of login, session count, and sum of session durations as the

metrics for user evaluation. This metrics can help identify and thus separate users who

make several sessions only in few days (may be visitors) from users who make sessions

everyday. We applied this clustering technique to Sororities and Fraternity user trace

from both Universities U1 and U2. We found that the best cluster size in each case

is 2. In each set we found that average silhouette width is above 0.65, 0.84 being the

maximum in one of the cases (more results in Tab. 2-1). The cluster size of 2 clearly

identifies our intuition of regular users and visitors and separates them using usage

behavior in that particular building/location. Also, the high average silhouette width

indicates the high quality of clustering. Detailed results of GBF are in middle column of

Tab. 2-2

Fig. 2-6 shows effect of total session duration, total number of sessions and

unique days of login over clustering of users. We can see a clear drop in the number of

sessions and unique days when the clustering changes from 2 to 1 (2nd cluster signifies

the resident). We notice that at the beginning of cluster 1 there is a spike in the total

duration but still these users are not included in the regular users as their number of

sessions and unique days of login are comparatively less than users belonging to cluster

2. Clustering ensures that all three metrics are incorporated when making a decision.

Similar results are obtained for other traces from university U1 (Fig. 2-7) and U2 (Fig.

2-8 and Fig. 2-9). GBF is generic and can be used to identify other social groupings

such as study-major, which will be investigated in our future research.

29

Page 30: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Table 2-1. Average Silhouette Width for Sorority and Fraternities from University U1 andU2

U1 U2Feb 2006 Oct 2006 Feb 2007 Nov 2007 Apr 2008

Fraternity 0.72 0.74 0.75 0.84 0.78Sorority 0.65 0.72 0.69 0.78 0.76

1

10

100

1000

10000

100000

1e+06

1e+07

0 100 200 300 400 500 600 700 800 900 1000

Num

ber

Users

Regular Users Vistors

Sum of durationDistinct Days

ClusterNumber of Session

A

1

10

100

1000

10000

100000

1e+06

1e+07

0 200 400 600 800 1000 1200 1400

Num

ber

Users

Sum of durationDistinct Days

ClusterNumber of Session

B

1

10

100

1000

10000

100000

1e+06

1e+07

0 200 400 600 800 1000 1200

Num

ber

Users

Sum of durationDistinct Days

ClusterNumber of Session

C

Figure 2-6. Clustering results for University U1 Sororities. A) Feb 2006. B) Oct 2006. C)Feb 2007. .

30

Page 31: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

1

10

100

1000

10000

100000

1e+06

1e+07

0 200 400 600 800 1000 1200 1400

Num

ber

Users

Sum of durationDistinct Days

ClusterNumber of Session

A

1

10

100

1000

10000

100000

1e+06

1e+07

0 200 400 600 800 1000 1200 1400 1600

Num

ber

Users

Sum of durationDistinct Days

ClusterNumber of Session

B

1

10

100

1000

10000

100000

1e+06

1e+07

0 200 400 600 800 1000 1200 1400 1600 1800

Num

ber

Users

Sum of durationDistinct Days

ClusterNumber of Session

C

Figure 2-7. Clustering results for University U1 Fraternities. A) Feb 2006. B) Oct 2006.C) Feb 2007. .

1

10

100

1000

10000

100000

1e+06

1e+07

0 100 200 300 400 500 600 700 800

Num

ber

Users

Sum of durationDistinct Days

ClusterNumber of Session

A

1

10

100

1000

10000

100000

1e+06

1e+07

0 200 400 600 800 1000 1200

Num

ber

Users

Sum of durationDistinct Days

ClusterNumber of Session

B

Figure 2-8. Clustering results for University U2 Fraternities. A) Nov 2007. B) Apr 2008..

31

Page 32: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

1

10

100

1000

10000

100000

1e+06

1e+07

0 500 1000 1500 2000 2500

Num

ber

Users

Sum of durationDistinct Days

ClusterNumber of Session

A

1

10

100

1000

10000

100000

1e+06

1e+07

0 500 1000 1500 2000 2500 3000 3500

Num

ber

Users

Sum of durationDistinct Days

ClusterNumber of Session

B

Figure 2-9. Clustering results for University U2 Fraternities. A) Nov 2007. B) Apr 2008..

2.1.1.3 Hybrid Filtering (HF)

As we do not know the ground truth or have the real data about the users, it is

difficult to validate the results of these classifications. In order to have a meaningful

analysis after the classification, we need to validate the classification. We validate LBC

via multiple techniques in Sec. 2.2. In one of the techniques, we compare the results

from IBF and GBF. Results are tabulated in Tab.2-5. We find that both methods mainly

select same set of users, which should be the case as both methods attempt to identify

regular users (males in fraternities and females in sororities). Therefore, for higher

confidence/correct classification and analysis in the later sections of the paper, we

choose the users selected by both filtering methods. We call this method Hybrid Filtering

(HF) as this uses results from both IBF and GBF. By doing so we successfully classify

majority of the users (more than 90% of the users selected by GBF are common to

users selected by IBF based method as shown in Tab. 2-5).

Our proposed scheme of LBC is generic and can classify users into social groups

if these groups have inherent location preferences (Sororities are females residences,

Computer Science major has strong ties with Computer Science buildings or theater

group meets often at the auditorium). One thing to note is that LBC and its filtering

32

Page 33: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

techniques do not need access to unanonymized MAC address. As long as the MAC

addresses are consistently anonymized, LBC is applicable. This property makes

LBC usable in most of the available WLAN traces. Next, we present Name Based

Classification (NBC) technique, an alternative to LBC.

2.1.2 Name Based Classification (NBC)

In this technique, we use the usernames of the WLAN users, which are sometimes

available in the traces. This field may be obtained on campuses and enterprises that

require authorization mechanism such as passwords to access WLAN. Including

username should not affect privacy of the user as these usernames are not private and

usually cannot identify a person. We approach our classification problem by exploiting

the fact that university U2, from which these traces were collected, provides usernames

and maintains a directory. This directory can be searched with the username (WLAN)

and users have the option of not listing their names in the phone book. This implies that

we can search and find the first names corresponding to the usernames for the users

who have made their information available in the phone book. We then use the list of

top 1000 males and females first names from the US Social Security administration

website [2] and remove the names present in both lists (neutral names). Thus, we get

the list of most popular male-only and female-only names. We run this list against the

list of names we find from the phone book, thus finding the gender of the users [32, 66].

In this technique, we do not have problem of visitors thus we do not need any filtering.

We observe that names from the US Social Security list may not be able to classify

foreign national students and non-popular names into gender groups, this however is

not a limitation of our method but of the name database. Using a more comprehensive

database should provide better classification. In this paper, however, we are more

concerned with a general methodology of classifying WLAN users, the details of how to

acquire a better database are out of scope of the paper.

33

Page 34: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Table 2-2. Results of classification of users from U1 (LBC) and U2 (NBC). ‘Common’signifies the users which were common to both male and female population.

U1-IBF U1-GBF U2-NBCFeb 2006 Oct 2006 Feb 2007 Feb 2006 Oct 2006 Feb 2007 Nov 2007 Apr 2008

Total Users 16416 22405 20302 16416 22405 20302 27068 29982Males(only) 506 553 545 451 437 417 5245 5807

Females(only) 513 570 509 441 456 410 5955 6817Common 0 0 0 22 37 29 0 0

Using NBC classification, we could classify 11,000 as males or females out of

27,000 users in the trace period of Nov 2007, and 12,500 as males or females out of

30,000 users in the trace period of Apr 2008 at University U2. Some of the users from

both trace periods have been marked as ‘Common’ since their names appeared in both

male and female name list. For purpose of this study ‘Common’ users were excluded

from both male and female user sets. Details of the classification are listed in Tab. 2-2.

Compared to NBC, LBC requires less information (username not needed); however,

we need to find a way validate LBC. One way to validate is to compare classification

results of LBC with NBC as shown in Sec. 2.2.3. NBC method is much closer to the

ground truth. The use of NBC is limited as the availability of usernames is limited to

a very few currently available traces. Once we check the correctness of LBC, this can

become the primary method for classification.

2.2 Validation of Location Based Classification

Validation of LBC is needed to raise confidence in the results from U1 i.e. users

classified as visitors are indeed visitors and not the regular users of that Access Point

(males in case of fraternities and females in case of sororities). Validation of the results

with the ground truth/actual reality is difficult, especially when we have developed the

methods for publicly available traces and information. Even if we get access to students’

university records, we would not be able to match it with student’s device (especially

when MAC addresses are anonymized). One approach is to conduct surveys for 50,000

users in each campus, the results are likely to be incomplete and noisy (erroneous)

34

Page 35: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

aside from the enormous efforts/resources needed if at all possible. Instead, we have

devised three statistical methods to validate our filtering mechanisms. The first method

finds out regular users in the trace-set belonging to adjacent months and compares

this list to see how many are common (temporal consistency). The second method

compares results from IBF and GBF to check the similarities in the results. The third

method takes the classification achieved using NBC method and compares it with the

results of LBC because NBC should be very close to the ground truth. The methods are

discussed in detail below.

2.2.1 Temporal Consistency Validation Using Adjacent Months

In this method of validation, we consider a pair of one month long trace-sets

belonging to adjacent months in the same semester (such as February 2006 and March

2006 from Spring 2006 semester) and use IBF, GBF, and HF filtering techniques to

find out how many users are common between the two adjacent months before and

after filtering. Assumption being that the set of users living in fraternities and sororities

do not change from one month to another in the same semester. If after filtering, the

percentage of common users increases then it is likely that this method works correctly

in identifying regular users. Tab. 2-3 and Tab. 2-4 show the results we obtain for both

fraternity and sorority users. We see that for fraternities, before filtering, the percentage

of common MACs in two consecutive months is around 60% to 64% and after filtering

it goes upto between 72% to 80% in all three filtering schemes. In case of sororities,

before filtering, we see that common users are between 66% to 72% and after filtering

the percentage of common users shoots up to 80% to 93%. This shows that filtering

schemes are selecting regular users, as percentage of common users rises dramatically

after filtering.

2.2.2 IBF vs GBF

The LBC technique in Sec.2.1.1 describes two main filtering techniques - IBF and

GBF. Both use location information to identify the gender; however, cut-off thresholds

35

Page 36: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Table 2-3. Similarity in the user population selected after filtering fraternity users for U1Before FilteringMonth(a) Month(b) # of Users(a) # of users(b) Common % usersFeb2006 Mar-Apr2006 1350 1441 816 60.4Oct2006 Nov2006 1520 1572 969 63.8Feb2007 Mar-Apr2007 1692 1875 1050 62.1

After Filtering- IBFMonth(a) Month(b) Male(a) Male(b) Common Males % CommonFeb2006 Mar-Apr2006 506 507 386 76.2Oct2006 Nov2006 553 518 401 72.5Feb2007 Mar-Apr2007 545 613 407 76.5

After Filtering- GBFMonth(a) Month(b) Male(a) Male(b) Common Males % CommonFeb2006 Mar-Apr2006 473 463 378 80.0Oct2006 Nov2006 474 445 371 78.27Feb2007 Mar-Apr2007 446 482 354 79.4

After Filtering- HFMonth(a) Month(b) Male(a) Male(b) Common Males % CommonFeb2006 Mar-Apr2006 416 409 332 79.8Oct2006 Nov2006 418 387 327 78.2Feb2007 Mar-Apr2007 399 419 311 77.9

for filtering regular users and visitors are set differently. Comparing the results of both

methods provides us with another validation mechanism. Tab. 2-5 shows comparison of

filtering results for 3 months long traces (Feb2006, Oct2007, Feb2007) from university

U1. We can see that greater than 400 (75%) users are consistently common in both

the methods. This points to the high degree of similarity, which validates the filtering

that both methods remove visitors and result in similar regular users (increasing the

confidence in our results). We note that GBF is more conservative (less number

of regular users) than IBF, which could be attributed to the fact that GBF takes into

consideration the usage attributes (session count, duration, distinct days of login) of an

average user for comparison (by using clustering), which can be higher than a regular

user selected by IBF. For the user behavior analysis, in the following section, we only

36

Page 37: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Table 2-4. Similarity in the user population selected after filtering sorority users for U1Before FilteringMonth(a) Month(b) # of Users(a) # of users(b) Common % usersFeb2006 Mar-Apr2006 991 1155 717 72.3Oct2006 Nov2006 1264 1305 844 66.8Feb2007 Mar-Apr2007 1169 1327 821 70.2

After Filtering- IBFMonth(a) Month(b) Female(a) Female(b) Common Females % CommonFeb2006 Mar-Apr2006 513 536 450 87.7Oct2006 Nov2006 570 557 461 80.9Feb2007 Mar-Apr2007 509 511 417 81.9

After Filtering- GBFMonth(a) Month(b) Female(a) Female(b) Common Females % CommonFeb2006 Mar-Apr2006 463 474 429 92.7Oct2006 Nov2006 493 456 432 87.6Feb2007 Mar-Apr2007 439 458 405 92.3

After Filtering- HFMonth(a) Month(b) Female(a) Female(b) Common Females % CommonFeb2006 Mar-Apr2006 435 449 402 92.4Oct2006 Nov2006 454 432 401 88.3Feb2007 Mar-Apr2007 406 413 367 90.4

Table 2-5. Validation - comparing users selected by IBF and GBF for U1Month Gender IBF GBF HF

Feb 2006 Male 506 451 416Female 513 441 435

Oct 2006 Male 553 437 418Female 570 456 454

Feb 2007 Male 545 417 399Female 509 410 406

consider the users selected by both filtering methods also referred to as Hybrid Filtering

(HF).

2.2.3 Cross Validation

NBC does not classify all users as either male or female (Sec. 2.1.2), however,

this classification has a low error rate because of using statistics from real data coming

from the US Social Security Office. Using this property of NBC, we can find out the

error bound for the LBC. Availability of the error percentage can help in realizing

37

Page 38: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Table 2-6. Cross validation of LBC by NBC for U2Month FL FL ∩MN Ef ML ML ∩ FN Em

Nov 2007 1280 74 0.058 334 25 0.074Apr 2008 1690 123 0.072 349 29 0.083

the error margins for LBC. To calculate the error bounds, the users (from sororities

and fraternities) classified by LBC as females and males are put in sets FL and ML

respectively.

Using NBC, we classify all users from Fraternities and Sororities and put them in

different sets. Females in set FN and males in set MN, and remove the unclassified

users. The unclassified set of users are those whose name existed in both male

and female databases or whose name was not in the database. The error in female

classification by LBC can be given by Ef , where Ef = (FL ∩ MN)/FL and the error in

male classification by LBC can be given by Em, where Em = (ML ∩ FN)/ML.

Tab. 2-6 provides results on the cross validation of LBC by NBC. We did the

analysis for trace sets coming from university U2 as it provides usernames along with

the information about AP located in the sororities and fraternities, which allows us to

perform both NBC and LBC. For Apr 2008 traces from university U2, the set FL has 1690

users after doing LBC and Ef is equal to 7.2%. In case of set ML, which has 349 users,

we find that Em is 8.3%. Similarly, in Nov 2007 traces, Em and Ef is less than 8.3%. The

low value of error, E , further increases our confidence in the LBC and validates the

classification method.

To sum, we find our location classification LBC (with three filtering tech-

niques - IBF, GBF & HF) are supported by three validation techniques. Validation

ensures the users selected by the filtering are indeed the regular users, which

in sororities means selecting females and in fraternities selecting males. The

filtering statistical errors were below 10%, and the confidence was found to be

over 90%.

38

Page 39: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

2.3 User Behavior Analysis

Classification of users into social groups is the first step in understanding the

usage differences between the groups. The classification techniques discussed in

Sec. 2.1 take all the WLAN users and divide them into various sets (depending on

the grouping criterion). For the gender based grouping, we have three sets : Male,

Female and Unclassified (grouping could not be determined). These groups can

now be evaluated on multiple metrics depending on the application. In this work we

have considered three generic metrics (not corresponding to any application). We

investigate the spatio-temporal distribution for wireless usage across genders in addition

to vendor preference. The main aim of these metrics is to examine the existence of

differences between the groups. We attempt to identify differences that are statistically

significant and consistent across the multiple traces we have studied. One observation

to make here is that it may not be necessary that such differences hold true in different

campuses or time-periods. However, knowledge of these differences (even existence)

may be important to protocols and services targeted at these groups of users. The three

metrics we use are:

a. WLAN Usage and Gender Spatial Distribution: What are the trends in WLANusage across different (buildings) areas on campus?

b. Average Online Time (Temporal distribution): Are there trends in the averageonline times of users and can differences be identified based on gender and areas(buildings) within the campus?

c. Manufacturer Preferences: Which device vendors do different genders prefer? Towhat degree does gender affect the choice of vendor?

2.3.1 User Spatial Distribution

An example of a metric is the spatial distribution of the users. This metrics can

identify where the classified users spend most of their time (regular users). For example,

by searching the female users in the complete trace we can find out the locations visited

by them. We refer these locations as “Area”, since they also represent major/department

39

Page 40: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Figure 2-10. Comparison of user distribution across the university U1 campus (inPercentage)

housed at that location. Here we only look into major trends by the active user. A user is

considered active (regular) at an area by using GBF. Difference in the number of users

among the genders can tell us about the building preferences of the genders. Fig. 2-10

and Fig. 2-11 show percentage distribution for males and females at Universities U1 and

U2 at various buildings. At both universities, we can see that there are more males than

females in the areas of Economics (by 39% at U1 and 33% at U2), Engineering (5% at

U1 and 89% at U2) and Law (by 83% at U1 and 6% at U2). Law area information for

Feb2007 is a outlier as we do not have any male student during that period. Females

are more in number than males in the area of Social Science (by 16% at U1 and 3%

at U2) and Sports (by 41% at U1 and 2% at U2). We see that at U1 and U2 trends are

opposite for the area of Music (U1 has 40% more females however U2 has 33% more

males). For more details see [47].

Existence of locations, which are consistently preferred by one of the two genders,

highlights the existence of difference in WLAN usage by two genders. Many of the

trends hold even across the two campuses. We believe this can be beneficial to several

application as discussed in Sec. 2.4.

40

Page 41: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Nov20

07

Apr20

08 --

Nov20

07

Apr20

08 --

Nov20

07

Apr20

08 --

Nov20

07

Apr20

08 --

Nov20

07

Apr20

08 --

Nov20

07

Apr20

08 --

Nov20

07

Apr20

08 --

Nov20

07

Apr20

08

0

10

20

30

40

50

60

70

80

90

100

110

Use

r Pop

ulat

ion

Per

cent

age

Area

Female Male

Admin Communication Economics Engineering Law Music Social Science Sports

Figure 2-11. Comparison of user distribution across the university U2 campus (inPercentage)

2.3.2 Average Duration or Temporal Analysis

Average duration of a session for males and females gives us an understanding of

the extent of WLAN usage at different areas. From Fig. 2-12 and Fig. 2-13, we observe

that males on average have longer sessions than females in most of the areas (on

average by more than 9%, in extreme cases by as much as 200%). On average, male

users tend to stay - as WLAN users - at certain places for longer times than females.

At both universities, we see that females consistently have higher average duration

than males in the area of Social Science (by 12.8% at U1 and 10% at U2) and Sports

(by 17.2% U1 and 8% U2). Males consistently have higher duration session at both

universities in the areas of Engineering (by 76% at U1 and 15.4% at U2) and Music (by

39.9% at U1 and 36.8% at U2). We see that females at university U1 consistently have

higher average duration in the area of communication (by 12%) where as males have

higher session duration at university U2 (by 10%). We also see clear trends at university

U2 that males have higher session duration at area of Economics.

Another observation of interest is that average duration per session decreases from

Feb 2006 to Feb 2007 (from 2789 sec to 2454 sec) in almost all the cases for university

41

Page 42: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Admin

CommunicationEconomics

Engineering LawMusic

Social ScienceSports

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Ave

rage

Dur

atio

n (s

econ

ds)

Area

Male-Feb2006 Male-Oct2006 Male-Feb2007 Female-Feb2006 Female-Oct2006 Female-Feb2007

Figure 2-12. Average duration of male and females in different Areas of university U1campus

U1 campus, we observe similar trend in university U2 (from 3800 sec in Nov 07 to 3609

sec in Apr 08). This points to the possibility that students are becoming more mobile,

and thus have shorter sessions at the same location.

While in some cases the trends were equal across genders, in several scenar-

ios we do find differences in WLAN usage among the genders. Some of these

differences were found to be significant and spatio-temporally consistent even across

campuses; females’ wireless activity is stronger in Social Science and Sports areas,

whereas males’ activity is stronger in Engineering and Music. In other scenarios each

university campus had a different trend specific to it. These findings are likely to have a

significant impact on usage modeling in wireless networks

2.3.3 Device Preference

In many available traces, partial MAC anonymization is done, such that top three

octets of the address (which identify the Manufacturer) are left unchanged. Traces

from both U1 and U2 use partial anonymization. These top octets can be used to

42

Page 43: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Admin

CommunicationEconomics

Engineering LawMusic

Social ScienceSports

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Ave

rage

Dur

atio

n (s

econ

ds)

Area

Male_Nov2007 Male_Apr2008 Female_Nov2007 Female_Apr2008

Figure 2-13. Average duration of male and females in different Areas of the university U2campus

find preferred vendors for the groups (Male and Female). In this metric, we are only

considering major vendors (by the number of users).

Fig. 2-14 and Fig. 2-15 show the number of users per vendor at University U1 and

U2. At university U1, it is interesting to note that Apple computers are more popular

amongst females than males. Intel devices are more popular amongst males. For

example, using the Feb 2006 traces we find that 25% of the males use Apple and 32%

use Intel, so that there are 28% more male users using Intel with respect to Apple users.

In the case of Females, 30% use Apple and 27% use Intel, so 12% more female users

use Apple than Intel. To test whether gender provides a bias towards specific vendors,

we use the Chi-Square statistical significance test. The Chi-Square test shows with

90% confidence that there is a bias between gender and vendor/brand. This holds true

for all the three trace sets from university U1. We also notice a consistent increase in

percentage of Apple computer users of both genders over the three trace samples.

For comparison of the results from university U1 with university U2, for this case

only, we considered users only from fraternities and sororities from university U2.

43

Page 44: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Apple Intel Askey Gemtek Netgear Hon Hai Enterasys D-Link Linksys0

5

10

15

20

25

30

35

40

Use

r Per

cent

age

Manufacturer

Male-Feb2006 Male-Oct2006 Male-Feb2007 Female-Feb2006 Female-Oct2006 Female-Feb2007

Figure 2-14. Device distribution by manufacturer at university U1

The classification of users was performed using LBC (similar to university U1). At

university U2, we do not find trends similar to university U1, we see that both the

genders consistently prefer Intel devices more than the Apple devices. We tend to

believe that preference of WLAN users can wary with geographic location and factors

such as affluent society, presence of Apple store on campus among others.

We also observe that vendors like Enterasys, Linksys, D-link and Askey Corp.

have a decreasing trend in terms of percentage of users. One of the reasons is that

these manufacturers mostly make external Wi-Fi devices for old laptops (with no built-in

Wi-Fi NICs). Currently almost all new laptops come with a built-in Wi-Fi, so the users of

external devices are decreasing.

These results indicate once more that there are statistically significant differences in

the usage pattern of the two gender. One possible implication of this device preference

is that PC viruses or malware propagation in some female groups may be less effective,

which will have a direct impact on security studies in future wireless societies as in

DTN [77].

44

Page 45: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Apple Intel Askey Gemtek Netgear Hon Hai ASUSTek0

10

20

30

40

50

Use

r per

cent

age

Manufacturer

Male-Nov2007 Male-Apr2008 Female-Nov2007 Female-Apr2008

Figure 2-15. Device distribution by manufacturer at university U2

2.4 Applications

Analysis of user behavior in the previous section highlights that statistically

significant differences exist in the usage pattern of the two genders. There can

be several metrics on which a group of users can be evaluated and their behavior

quantified. The results from these metrics can then be applied to an existing or new

application to make it context sensitive. In this section, we discuss few applications

which will benefit from the quantified differences among the groups such as mobility

modeling and protocol design. We also discuss impact of this analysis on user privacy,

wireless network deployment, and resource management among others. For the lack of

space, more details of the application are omitted.

2.4.1 Mobility Models

Mobility models are important tools to understand user movements and create

models on which protocols can be tested. The knowledge of groups can be used

to re-evaluate mobility models such as TVC [37], IMPORTANT [16], and several

others [15]. This enhancement can allow us to model/evaluate social groups on

‘behavioral’ aspects, load (sessions duration) and density among others. This kind

45

Page 46: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

of study can only be possible by using the methods mentioned in this work, other

methods like taking a survey of 50,000 users would require tremendous effort and may

still have similar error rates.

2.4.2 Protocol Design

Protocol and service design in Mobile Ad-Hoc networks can take features of

various groups to evaluate its performance. It has been shown in Profile-Cast [39] that

considering behavior of users (profiles), one can create efficient protocols for Mobile

Ad-Hoc Networks. This work does not consider difference among groups of people. It

has also been shown that users with similarities meet often and have closer ties [60].

Can similar people (belonging to same group) have higher chances of meeting more

often? Can this knowledge increase the message delivery success? Our method helps

in identifying the social groups, however, further investigation needs to be done such as

combining this group information with services such as Profile-Cast

2.4.3 Privacy

A major impact of this work is bringing the privacy related issues with traces

to forefront. Determining gender from the traces which were anonymized, shows

weaknesses in current anonymization techniques. It may be argued that anonymization

of location information may prevent this kind of classification, however, this not only

decreases the utility of the traces, but also the authors in [48] show that location

anonymization can be easily undone. The primary reason is the unique session patterns

of the WLAN users. Anonymization of WLAN traces while maintaining utility of the traces

is a challenging task. Our work also points at this significant problem.

2.4.4 Resource Management

Knowledge of group behavior can also be helpful in planning WLAN resource

deployment and capacity modeling. Questions like how the usage would change

if more admissions are given to computer science students versus law students or

females/males can now be answered in better light.

46

Page 47: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

We all have intuition where and how a certain group of users may use WLAN, our

method allows to quantify this intuition. We believe that methods discussed in this work

are the fundamental step for many interesting studies in the future.

2.5 Conclusion And Future Work

In this study, we propose novel methods, which use WLAN traces to classify WLAN

users in to social groups based on features such as gender and study-major among

others. The work presents a general framework that can be applied to traces coming

from multiple sources. As an example, traces from two university campuses have

been used and gender based grouping classification is performed. Multiple techniques

for grouping users are discussed since each one has slight advantages in certain

scenarios. The study cross-validates the results by comparing results provided by each

of the classification methods.

Results from this research are based on a sample of the user population, since

gender may be identified based on sorority and fraternity wireless access point

associations or based on name filter. We find that there is a distinct difference in WLAN

usage patterns for different genders even with similar population sizes. Availability of

results comparing groups of users can allow researchers to quantify the behavioral

differences between the groups. We see that these trends and characteristics are

consistent over periods of time and across different semesters and sometimes even

across university campuses. We also see some trends that are not consistent across

the two university campuses like the vendor preference. At one university females show

a statistical trend for preference towards the Apple computers, however, no similar

observation is made at the other university. We think that some social characteristics

are dependent on the location of the University campus and other facilities around the

campus (like presence of Apple store, affluent population). Even though the results vary

with time and location, it may be essential for a protocol designer of mobile networks to

understand the characteristics of this network.

47

Page 48: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Interestingly, we were able to classify users into males and females and were

successful in obtaining their preference of vendor, based on analysis of anonymized

traces (university U1 study did not use usernames). We were also able to validate our

results. This raises several privacy issues. Can private information of individuals be

identified by analyzing anonymized traces? What kind of anonymization algorithms

should be used for mobile networks traces? And how can such algorithms provide a

notion of k-anonymity [76] for the mobile society while retaining useful information for

researchers? These are questions that bear further research and we plan to address

them in our future work.

In the future, we plan to prepare mathematical models, which can represent a user

in a particular group. This process would allow us to understand various features, which

represent the user’s WLAN usage characteristics. It would also allow us to classify users

into groups by looking at the features only. User model would also be useful in tailoring

the protocols for multicast and profile-cast to incorporate the group behavior.

We hope for this study to open the door for other mobile social networking studies

and profile-based service designs based on sensing the human societies.

48

Page 49: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

CHAPTER 3BREAKING ANONYMITY IN WLAN TRACES

The advent of portable/mobile devices and availability of ubiquitous network

coverage using heterogeneous wireless technologies like Wi-Fi (IEEE 802.11), GPRS,

3G and Wi-Max, has allowed humans to browse information on the go. From sharing a

computing device at home, office, or a commercial establishment, we have come to an

era where these devices have become very personal and customized to user’s taste.

A major impact of this change (apart from all the benefits of being mobile) is that these

devices have become sensors of the human society. As these devices remain with

their owners for many hours in a day, they can capture large amounts of user behavior

patterns, which can be made available to researchers. On one hand, the study of such

data can be used to develop better understanding of human behavior and provide

improved services, on the other hand, availability of this kind of data can be considered

an infringement on the privacy of the user.

Several researchers use WLAN traces for research and analysis purposes such as

to examine usage behavior of users[35, 38, 49], discover characteristics for developing

network protocols[39] or to study user mobility patterns[25, 37, 65]. Many of the WLAN

traces are publicly available[7, 41]. It is, therefore, important to understand how the

privacy of WLAN users gets affected. In this work, we investigate the extent of user’s

private information that can be extracted from the anonymized Wireless Local Area

Network (WLAN) traces. Even though most of the trace libraries anonymize/sanitize

the traces to protect user’s privacy, we present several methods, which can be used to

reverse the anonymization. We attempt to expose the weakness in the currently used

anonymization techniques and bring attention of the WLAN research community on this

fundamental problem. We find that WLAN traces are unique in the sense that human

movement pattern gets embedded in them, which can have unique signatures. These

signatures can be later combined with publicly available information from such sources

49

Page 50: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

as directories or schedules to identify a user even after anonymization. Despite the

importance of privacy issues in WLAN traces, there is a lack of significant research in

this field. The purpose of this study, therefore, is to shed light on the need of better

anonymization techniques and identify a rich set of plausible scenarios in which

anonymity can be compromised.

The issues of privacy and anonymization have always been present in network

traces. Researchers have also faced challenges in anonymizing the wired traces[69].

Recently, wireless traces have also been collected and archived at on-line public

libraries like CRAWDAD[7] and MobiLib[41] that collectively hold well over 50 traces. As

these are pervasively captured user information, several questions have been raised

about the process of collecting traces[12, 74]. Techniques are being researched such

that users themselves can shares their traces[73]. However, the pertinent question,

which still remains unanswered is that once traces are collected, how can they be

prepared for distribution such that they have a good utility, as well as, they do not

compromise the privacy of the users. Our efforts are targeted at this question, which has

become even more challenging with the WLAN traces, as we shall discuss in this paper.

In this work, we present our analysis of the currently used anonymization methods and

their shortcomings.

The next section presents the information available in the WLAN traces. Sec.

3.2 presents example scenarios where identifying a user and monitoring his usage

pattern can be detrimental to his privacy. These cases justify the need for fail-proof

anonymizing/sanitizing of WLAN traces. We discuss prevalent methods of anonymizing

WLAN traces in Sec. 3.3, following which we discuss attack scenarios and methods,

which can be used to break WLAN anonymization. Sec. 3.4 presents an analysis of how

the anonymization could be broken. Sec. 3.5 provides an analysis of the attacks and

discusses different possible approaches that can be used to prevent evasion of privacy,

50

Page 51: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

though this remains an open question. In the last section, we summarize our findings

and present directions for future research.

3.1 Information In WLAN Traces

WLAN traces are logs of user association with wireless Access Points (AP). A

generic information tuple, after some processing of the raw trace, has MAC ID, Start

time, Duration and Access Point/Location.

Table 3-1. WLAN trace sample: before and after anonymizationMAC Start Time Duration(sec) AP/Location

00:11:22:33:44:55 01 Jun 2008 21:00:51 GMT 3000secs CS buildingAP111:22:33:44:55:66 01 Jun 2008 21:01:30 GMT 10secs ECE buildingAP201:02:03:04:05:06 01 Jun 2008 22:11:00 GMT 200secs MSL buildingAP110:20:30:40:50:60 01 Jun 2008 22:15:30 GMT 600secs MACA buildingAP111:22:33:44:55:66 01 Jun 2008 22:23:10 GMT 180secs CS buildingAP3

a. Sample un-anonymized trace| | | || | | |

Partial & consistent No change No change Location Anonymization| | | |↓ ↓ ↓ ↓

MAC Start Time Duration(sec) AP/Location00:11:22:0353 01 Jun 2008 21:00:51 GMT 3000secs AcadBldg10AP111:22:33:0521 01 Jun 2008 21:01:30 GMT 10secs AcadBldg2AP201:02:03:9877 01 Jun 2008 22:11:00 GMT 200secs Library5AP110:20:30:3260 01 Jun 2008 22:15:30 GMT 600secs AcadBldg22AP111:22:33:0521 01 Jun 2008 22:23:10 GMT 180secs AcadBldg10AP3

b. Sample anonymized trace

A snapshot from an un-anonymized trace, is shown in Tab.3.1a. Some traces

may provide more information such as username. For the sake of simplicity, we

have considered the basic tuple similar to shown in Tab.3.1. Using a tuple with less

information makes the breaking of anonymity any easier as compromising anonymity

with less information is more difficult.

51

Page 52: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

3.2 Need For Anonymity

Although the implications of losing privacy in the real world are well known, in this

section, we discuss the implications related to the loss of privacy in WLAN traces. As

Tab.3.1 shows, MAC address is one of the fields in the traces. This field is the link-layer

address of the hardware/device used to access the WLAN network. Users generally do

not change their MAC addresses between the sessions (perhaps due to lack of tools,

which do it effortlessly or due to lack of awareness) and current protocols do not allow

a user to change his MAC address during the session. This implies that MAC address

becomes a permanent identifier of the machine. Since most of the machines using

wireless are portable, they are less frequently shared by people. MAC address, thus,

becomes associated to the person and hence his/her identifier. If we know MAC address

of a device and its user, then we can search for that user in the WLAN traces and

essentially know the places visited. MAC address of a device can be found by various

methods such as sniffing the wireless channel.

Greenstein et al.[34], with the help of case studies, have shown how capturing and

analyzing of 802.11 protocol packets can be used to evade user privacy. The cases,

which we present, show similar threats as shown in this paper [34]; however, we are

using only the WLAN traces and are not coupling it with actively captured data packets.

In our case, threats become even more serious because the attacker need not be

present in the same geographic location as the attacked/victim (traces are available

on the Internet [7, 41]). Tracking the attacker can also be difficult due to the fact that

some of the WLAN traces are publicly available with little or no security checks or log

mechanism. Below are some cases that show possible attacks on user privacy:

1. One can prove someone’s presence at a location by showing the association of hismachine with AP located in that vicinity.

2. If one knows MAC-to-name mapping of a user, he/she can trace the user by findingthe location of AP with which the user associates. Therefore, he/she can get user’s

52

Page 53: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Table 3-2. Fields present in each record of wired trace, basically a IP-HeaderFields

Version Header Lent Type of ServiceIdentification Flags Fragment OffsetTime to Live Protocol Header Checksum

Source Address Destination Address OptionsData

daily activity pattern/schedule (Imagine if a thieve knows exactly when one is goingto be away from house or in which time interval nobody is in the office).

3. By looking at the MAC addresses associated with a particular AP with which a userassociates, one can make a guess about the people the user is meeting with. IfMAC addresses to name mapping is available for all MACs, this would be a trivialtask.

4. Information can be used as a forensic evidence against the user (or as an alibi).

These scenarios show us some of the possible privacy infringements, if the WLANs

are available without anonymization. Trace providers are aware of these concerns and

therefore anonymize the traces before making them public. In this study, however, we

show that the anonymization techniques used can be compromised and users can

be identified to some extent even after anonymization. The next section provides an

in-depth discussion of the anonymization techniques used in WLAN, this would allow us

to better appreciate the attack as well as the complexities involved in anonymizing the

traces.

3.3 Related Work

The wired network traces have existed for some time and many libraries have been

created for sharing the traces[4, 6]. Researchers have developed several anonymization

techniques[63, 69, 81] for wired network traces. Several tools have been developed

such as Tcpmkpub[69] and Tcpdpriv[62]. We looked into these traces and techniques

to investigate if they can be applied to WLAN traces. We, however, found that even

though highly sophisticated techniques have been proposed to anonymize the wired

traces they are not completely unbreakable[27]. In addition, there are fundamental

53

Page 54: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

differences between Wired and Wireless LAN (WLAN) traces, which makes it difficult

to apply Wired trace anonymization on WLAN traces. In terms of anonymization goals,

in wired traces, the goal is to prevent discovery of identities of network resources

and leakage of security policies [69]; however, anonymization in WLAN traces, also

requires protection of user’s identity[34, 68] as the network resources are personal

devices. Wired traces (also called netflow) have fields as shown in Table 3.3, which is

essentially an IP header (IPv4). WLAN traces can have this information along with other

information as in Table 3.1a, which is generated by association and disassociation of the

device with the access points (AP). As this feature is unique to WLAN usage, we face

newer challenges in anonymization. We can see that complete WLAN trace (along with

netflow) is a super set of wired trace (only netflow). In WLANs, generally IP address are

assigned using DHCP protocol and the subnet varies with WLAN access location. This

reduces chances of same machine getting the same address on every session, which in

wired traces can be considered 100% (assuming static assignments only). This makes

anonymization of netflow information from WLAN traces much simpler than wired traces.

We find that in many studies[25, 35, 36, 39, 49] regarding WLAN traces, researchers

have only used association traces such as shown in Table 3.1. In fact, most of the

WLAN trace libraries[7, 41], do not have comprehensive netflow traces as they have

the association traces. One of the reason is the difference between association traces

over netflow data in WLAN. Netflow information (like in wired traces) are usually used

to understand the behavior of the applications[44, 64], to detect anomalies in the

network[5, 82], network protocol designs, and network planning [28, 75]. Wireless traces

have been used for network planning[25, 35], understanding user behavior[25, 36, 49],

DTN protocol designs[39], and understanding societal interaction with technology [49].

Overall, we see that even though rich set of techniques are available for wired

traces, their applicability to WLAN traces seems insufficient due to above reasons and

54

Page 55: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

because of similar reasons, the attacks on WLAN traces would be quite different than

the attacks on wired traces.

Although anonymization is a very important step in releasing WLAN traces, we

could not find any published work that deals with the techniques most suitable for

WLAN. Most of the techniques used are not thoroughly investigated in the light of WLAN

traces. This will be more clear in the next section where we talk about the possible

attacks and drawbacks of the existing methods. Rest of this section examines the

anonymization techniques currently used in WLAN trace anonymization.

Current Techniques:Anonymization in WLAN traces is done on field by field

basis[41, 46]. Either a field is fully anonymized (mapped to a random number) or only

a portion of the field is anonymized. In the traces having multiple sessions per MAC

addresses, trace providers can either randomize the MAC address to a unique value

for each session, or use the same anonymization mapping of the MAC address for all

the sessions (consistent mapping). This step decides the information and utility of the

traces. Consistent mapping for each MAC throughout the traces, provides ability to track

a user through multiple sessions. Majority of the traces available at MobiLib[41] and

Crawdad[7] provide the consistent mappings.

Some traces like Dartmouth traces[46] at Crawdad[7] anonymize the location

field by giving a building level granularity of the AP’s location or by anonymizing the

building name with code names such as AcadBldg10AP3[46], which signifies an AP

(numbered 3) located in a building used for academic purposes. In this case, all the

buildings are grouped into building classes such as acadbldg, librarybldg etc. Tab.3.1b

shows how WLAN traces would look when anonymized for consistent and partial MAC

anonymization with reduced location information. We will attempt to extract private

information from traces which have been anonymized using this technique as this is

used by many trace providers[46].

55

Page 56: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Figure 3-1. Attacker capabilities

3.4 Attack Scenarios

In this section, we present techniques where user privacy can be theoretically

compromised. Fig. 3-1 shows attacker capabilities in terms of information related to the

traces collection environment he can access. Attacker is assumed to have access to

anonymized traces in all the scenarios. In this work we are, however, not dealing with all

the possible scenarios as our aim here is to bring forth the shortcomings of the current

anonymization, which can be achieved even if we can break the anonymization for one

case. We are considering two possible attack scenarios: one where attacker can inject

data into the traces by accessing the WLAN network (Sec. 3.4.1, 3.4.2 and 3.4.3) and

second where attacker has physical access to the campus but cannot access the WLAN

network (Sec. 3.4.4). If we can identify anonymized MAC address in the traces for any

user, we will consider that anonymity has been compromised. This can be justified since

the main purpose of anonymization is to prevent user identification is to prevent user

identification. Using this definition of compromise, we will show how an attacker can

identify his own anonymized MAC address and then how can he identify any other user’s

MAC address.

56

Page 57: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

3.4.1 Identify Your Own MAC In Trace

Using the definition of anonymity compromise, even if an attacker can identify

his own MAC address, it should be considered a failure of anonymization techniques.

Although this is not a serious breach of privacy per se, yet an attacker can now use this

information to find out building codes and identify MAC addresses for other users. Steps

for obtaining one’s own MAC address are as follows:

1. Go to a WLAN covered area in the campus, at a time when it is not frequentlyvisited and the WLAN usage is minimum (find this pattern from the previoustraces).

2. Associate with an AP belonging to campus network, and mark the start time andend time.

3. If there are some people around the area, move to a new location which is at least100 ft away (beyond range of the previous wireless AP) and repeat Step 2.

4. Now go back to study the traces and find all the MAC addresses (anonymizedthough), which log-in at the same time and log-out at same time at the twolocations visited.

5. If there are several MAC addresses, one needs to repeat this experiment fromStep 1 to 4 and then take a intersection of the MACs. In the end, there should beonly one MAC address left after the intersection.

This will provide ones MAC address’s mapping in the traces. In Sec. 3.5, we mathematically

show that even in a large environment (over 500 AP), at most 5 iterations of steps 1 to 4

would be enough to identify your own MAC address.

3.4.2 Identifying Building Codes

Identifying the building codes is useful for finding users at a particular location. The

attacker who knows his anonymized MAC address can visit all the buildings of interest

in the campus and mark his login and logout time at each building. While looking back

at the trace he can reverse map all the building codes to actual building codes/names by

correlating the timings in the notes with the actual trace.

57

Page 58: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

3.4.3 Identifying A Person

Once we have the building codes, one can target a specific person, follow him

and mark his device’s start or end times (observing opening and closing of laptop lid).

Filtering the traces with this approximate timing information and building information,

one should not get many sessions. If one does then one can repeat this process and

zero down to a single MAC address belonging to the target (publicly available schedules,

status messages on social networking websites can also be used to find approximate

login and logout timings).

To discover mapping of large number of MAC addresses to their real MAC address,

one can sniff all the wireless traffic at a location (AP) whose trace mapping is already

known, parsing this captured data for messages which clearly show that a machine is

trying to associate with the AP [68]. In this case, we have the precise time of the user’s

log-in and also the MAC address with location. Identifying his anonymized MAC should

be trivial. And once we know the mapping to real MAC address in the traces one can

track that person anywhere on the campus.

Using the above methods, in theory, an attacker can track any person throughout

the campus, causing a breach of privacy. This method presents a serious shortcoming

to the prevalent methods. It shows a possibility of a privacy attack without much effort.

If one does not have access to the campus Wi-Fi, one can ask a friend or one may use

social networking skills to ask a complete stranger to do it. We also observe that even if

the trace providers do not provide traces on daily basis, a careful planner can undertake

several such experiments and then wait for the trace provider to release the trace and

perform his attack.

3.4.4 Multiple Filtering

In above described methods, the attacker has to have a capability to inject data into

traces collection system (should have authorization to access the WLAN). In the current

case, we consider an attacker with no ability to access (and inject data into) WLANs.

58

Page 59: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

He is limited to the physical access to the traces collection environment. Researchers

have attempted to classify WLAN users based on their genders[49]. We extend this

idea further by grouping users based on different categories like gender, login time,

building, and manufacturer of the device. We, then attempt to identify users who appear

under multiple categories (find intersection). In all these individual categories, the

group size is large (∼100). However, when we intersect the groups, this size drops

rapidly. For example, female student going to Law building in the morning with an

Apple computer resulted in a single user. This finding has privacy implications. Taking

the above example, just by watching a female student going to a law school building

with an Apple device in hand, should enable a attacker to go find the anonymized

MAC ID of the student in the traces. Once it is accomplished, the attacker can trace

the student’s movement throughout the campus. This is a serious breach of privacy.

We have conducted analysis to examine how many users can be identified using a

filter using gender, study major and network card manufacturer (on a feb2006 trace

downloaded from MobiLib[41]). We found that for 111 different filters (formed by different

combinations of gender,study major and manufacturer), 35% resulted in a single user

and 60% of the cases had less than 3 users (Fig. 3-2). We did the analysis for three

different traces periods (feb2006, oct2006, feb2007) and found similar results. We also

used different filters like gender-major-time, and again obtained a similar result. This

method exposes a major flaw in the anonymization technique.

.

3.5 Analysis and Mitigation

The attacks mentioned in the previous section were feasible because attacker

could identify unique WLAN usage in the traces. The attacker could identify MAC

address of his machine by creating usage patterns that were unique for that traces

collection environment. Patterns are formed because MAC addresses are consistently

anonymized. Therefore, considering all the sessions made by a device (identified by

59

Page 60: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Figure 3-2. Percentage of no. of users found, when 111 filters based ongender+major+manufacturer are applied

MAC address), one can identify individual usage sequences from fields in the trace

like location, start time and duration. For example, a user who starts using WLAN

everyday around 9 am is creating a pattern with respect to start time. This pattern

may not be unique as there may be several users starting WLAN usage around 9 am.

However, one can reduce the search space or may even make the pattern unique by

combining location and duration patterns with start time. Consider employees working

in same office space and having same office hours and work load. They would have

similar start time, location and duration patterns. However, if the office and residences

share a common WLAN service (say City-wide wifi or students living on-campus), the

location, start time and duration of WLAN at residences would become different for

all the users (unless each and every employee has the same residence and follows a

similar lifestyle!). The argument here is that users can have sufficiently unique usage,

which can be used to identify them even though traces are anonymized. In the next

two sub-sections we present our reasoning in support of the above argument. We do a

theoretically and a practical analysis on real WLAN traces.

60

Page 61: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

0 20

40 60

80 100

0.02 0.04

0.06 0.08

0.1 0.12

0.14 0.16

0.18 0.2

1e-20 1e-10

1 1e+10 1e+20 1e+30 1e+40 1e+50

Number of Access Points (a)Percentage of A

P a user visits (p u)

Num

ber

of u

niqu

e us

age

patte

rns

UL at n=5

Figure 3-3. UL at n = 5

3.5.1 Theoretical Analysis

Mathematically, it can be show that each field in the trace can create enormous

amounts of patterns. For the sake of simplicity, we are only considering the patterns

generated by location because similar equations can be used for other fields. Let UL be

the number of unique usage patterns possible using location field only.

UL(a, pu, n) = Ca(a.pu).(a.pu)

n

where a is the total number of Access Points/locations, pu is the percentage of total

Access Points/locations a user visits, n is the number of sessions and C denotes the

combination function. Fig. 3-3 shows the distribution of UL. UL is a product of the

number of ways a.pu Access points can be selected out of total a Access Points (C a(a.pu))

with number of ways in which a.pu Access Points can be selected in n sessions ((a.pu)n).

As an example, consider a university campus having hundreds of buildings, say

University of Florida (UFL), which has over 500 hundred wireless access points, so

we can have 500 different values in the location field. It has been shown, that users

generally use less that 5% of the Access Points 90% of the time [17, 36]. Therefore,

in our case (a = 500 APs), we assume each person uses only 5% (= pu) of them

(a.pu = 25). Because in a pattern not only visiting a location but also the order of visiting

61

Page 62: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Table 3-3. Result of finding users with similar location visiting sequences with varyingduration of the trace

Period (5 Nov 2007) (5 to 11 Nov 2007) (5 to 18 Nov 2007) (Nov 2007) (Aug to Dec 2007) (Aug 2007 to Jul 2008)Total Users 9844 17602 22333 27068 47766 52217

100% match scoreusers 4288 4847 4969 4461 4288 4880

> 1 session 1477 1872 2061 1928 1840 2186> 5 sessions 31 121 108 131 187 235

90% match scoreusers 4291 4494 5300 4879 4743 5486

> 1 session 1480 2018 2391 2345 2294 2791> 5 sessions 34 268 439 548 642 839

80% match scoreusers 4473 6068 6924 6872 7484 8954

> 1 session 1662 3092 4015 4339 5036 6260> 5 sessions 113 1085 1777 2272 3057 3930

a location is important, we can see that total number of combinations of APs people

can choose from is C 50025 ∼ 1046. Assuming that traces contain only 5 sessions per

user (n = 5), the total number of paths possible for a user, using 25 APs, is equal to

255 = 9765625 ∼ 106. Therefore, the total number of unique location pattern possible,

UL is ∼ 1046 × 106 = 1052. Total number of students at UFL ∼ 5 × 104. So, theoretical

number of unique location pattern per user = 20×1046. Even though this is a very lose

upper bound and in reality this number can be smaller, what it shows is the enormous

number of possible unique patterns that can be generated using just one field (location).

This implies that theoretically every user can have a unique pattern in a short time,

which can be used to identify him. This further implies that sanitization techniques

cannot work well, if only the fields are anonymized; one should aim to anonymize the

patterns. One of the ways is to use inconsistent MAC anonymization, which is extremely

detrimental to the utility of the traces, the very reason traces are shared. A fundamental

question about the relationship between the utility and the anonymization/privacy is

evident here, which we plan to discuss in our future works.

3.5.2 Practical/Trace Analysis

To check the validity of the theoretically limits discussed above, we did an

experiment on WLAN traces coming from UFL for a period of one year. Tab. 3-3 has the

findings for users having same location visiting sequence. We calculate and distinguish

users based on location field using Longest Common Subsequence algorithm [26]. We

62

Page 63: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

0

50

100

150

200

250

300

350

0 20 40 60 80 100 120 140 160 0

20

40

60

80

100

Num

ber

of s

essi

ons

(ni)

Pi

Users

Number of sessions (ni)Pi

Figure 3-4. Results of the combination generation and sequence matching for randomlychosen 230 users out of 27K users belonging to the month of Nov 2007.This graph shows Pi and ni .

find the number of users having similar location visiting pattern with at least one other

user, considering several time periods (1 day to 1 year), listing total WLAN users in

that specific period. Tab. 3-3 also shows number of users who had number of sessions

greater than 1 and 5. Results support our insight behind the theoretically limits. We

notice, that for a period of one year, only 4880 users had a similar location visiting

sequence with one or more users out of ∼52K users (9%), if we consider 100% match.

This means that almost 91% of users have distinct location visiting sequence and a

attacker following a user can later identify him/her in the traces with probability greater

that 0.9. Another result that further supports the above statement is that only 235 users

(0.45%), who have same location visiting sequence with other users, have more than 5

sessions (in case of 100% match score). This further strengthens the theoretical limit

we discussed earlier (to make it more interesting we found that most of these users had

logged in to the same access point throughout their multiple sessions).

We also attempt to identify the source of these sequences, which become unique

in a short time span. We note that not only each field can be used to form unique

sequences but several fields may be combined to form unique sequences. We

63

Page 64: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

generate various sequences using several combinations of location field for a user,

maintaining the temporal ordering in the combinations. This helps us to identify how

much information an attacker may obtain about a user, even if the attacker follows him

for only a few sessions. Because of this, attacker would find information holes in the

observed sequence for the user. For example, he may be able to observe only 2nd , 3rd ,

6th, 8th and 10th sessions of a user. We investigated 230 randomly selected users from

a set of 27K users appearing in Nov 2007 WLAN traces from UFL. For each user, we

created all the possible combinations of sequences of length 5 using Location field,

maintaining temporal order (earlier we saw that users with number of sessions greater

than 5 sessions have higher chances of being unique). Each combination represents

a possible set of sequence an attacker may be able to capture by following a user,

assuming attacker may not be able to capture all the user sessions. This simulates

loss while capturing user information. Then we search for these sequences in traces

belonging to all 27K users. Let Pi be the percentage of matches for user i , where Pi

is defined as Pi = Mi/C ni5 . Here C ni5 represents the total number of combination of

sequences possible of length 5 for user i , ni is the number of sessions for user i and

Mi represents the number of matches found for C ni5 sequences in the trace belonging

to 27K users. Fig. 3-4 shows the results for this experiment. We find that out of 230

users, 78 had less than 5 sessions in the whole month and were discarded. For the

rest of the users we plot Pi in descending order along with ni . One interesting result

is that even when the total number of combinations generated is very high (ni = 100,

C ni5 = 75287520 ∼ 107) and the number of matches is very low (Mi = 81). This indicates

that if the location information of 5 sessions is available in temporal order with many

intermittently missing location information of a user, even then there is a very high

chance of identifying the user in the trace.

As per the analysis we conducted, there can to be two ways of mitigating the

attacks discussed in the previous section. One is to manipulate the traces in such a

64

Page 65: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

manner that no one can identify unique patterns and the other is to prevent linking

of usage patterns to users. Both these abstract ideas can be applied to the traces

independent of each other. If one can identify usage pattern, but cannot assign it to a

specific user, one can never be sure of identifying the correct user or the correct pattern

of the user. On the other hand, if we can prevent linking of usage patterns to users, then

no matter how many unique usage patterns one can identify, one would not be able to

link it back to a user. Both methods should individually provide sufficient privacy for the

users. For the first method many techniques exist in literature such as k-anonymity [76]

or l-diversity [57]. For the second method, we need to devise techniques, which can

obscure linking information.

3.6 Conclusions and Future Work

We have uncovered a serious problem in the way WLAN traces are anonymized.

We believe that this kind of attack is possible as WLAN traces have human behavior

pattern embedded in them, which can be easily observed by an attacker following the

victim. The aim of any privacy protecting technique should be to ensure that even if

attacker has access to all the publicly available information about a user or a group of

users (but not the mapping between anonymized MAC and real MAC), he should not be

able to reduce the sample size below a number, say K. This K should be a parameter

configurable by the trace releasing authority.

In the future, we plan to work on the feasibility of anonymizing using techniques like

perturbations and release of traces in multiple different formats like one with no location

or time information. We would also like to investigate in further details how the fields

like start time, duration and locations are responsible for generating unique patterns.

It may be due to the atomic properties of these fields like periodicity and history. We

would like to work on a system, which can generate anonymized traces according to

the security clearance of the demanding user, this would allow us to serve traces with

65

Page 66: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

varying anonymization and privacy criterion and would make traces more useable. We

also plan to investigate, if k-anonymity model [76] can be applied to WLAN trace.

Findings in this work certainly call for a new research in the area of WLAN trace

anonymization and privacy, details of which are to be pursued in our future work.

66

Page 67: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

CHAPTER 4AN ENCOUNTER-BASED FRAMEWORK FOR TRUST

The success of future mobile applications hinges on its wide adoption and

acceptance by the mobile users through increased interaction and cooperation. These

factors become particularly crucial for emerging classes of mobile networks that include

peer-to-peer networking; such as mobile ad hoc (MANETs), sensor and delay tolerant

networks (DTNs). This study introduces and investigates a new mobile application

aiming to improve interaction and cooperation by leveraging social connections and

gaining confidence and trust in new opportunistic encounters.

The establishment of trustworthy networking is of prime importance, since most

interactions rely on trust establishment. This challenging problem is further exacerbated

by the uncertainty and dynamics in mobile networks. Furthermore, in MANETs and

DTNs cooperation and trustworthy networking are imperative to the construction and

operation of the network, without which these networks would fail.

Several factors pose great challenge to the practical and effective study and

establishment of trust and confidence. First, conventional reputation and credit-based

systems rely on prior interaction to score trust. However, in the absence of such prior

interactions (due to introduction of new technology or psychological barrier), such

systems are not effective. We refer to this problem as the trust bootstrap problem, and

its solution is essential for jumpstarting trustworthy operation. Second, the utility of the

trust system is difficult to validate against the ground truth. Trust is a social trait; it is

subjective and contextual. Only through deployment and testing can the efficacy of such

a system be evaluated. Third, attacks to gain unwarranted trust are harder to detect due

to mobility, resource-constrained devices or lack of infrastructure. A secure trust system

should be stable and resilient against attacks.

At the same time, several unique characteristics of mobile networks provide

new opportunities to tackle the above challenges. The use of short range radios

67

Page 68: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

(e.g., Bluetooth, Wi-Fi) enables detection and utilization of proximity and encounters.

Encounters represent an interesting primitive that can be used to construct abstractions

for reasoning probabilistically about trust, and for establishing encounter-based

keys [24, 55] that can seed future secure communications. In addition, the increased

capabilities of mobile devices, in terms of computation, storage, communication and

sensing, can add important contextual information to encounters, such as locations,

events, and statistical history. The processing of such information could augment the

users network view and awareness to score trustworthiness of other nodes and to

establish encounter keys or challenges (through out-of-band face-to-face exchanges).

Furthermore, the tight coupling between users and mobile devices enables new and

accurate ways to establish behavioral profiles that can be used to fine-tune the trust

processing; e.g., by adding more weight to trusted locations. It is the fusion and

integration of these multi-dimensional data, that provide the promise in establishing

trustworthy opportunistic networking in ways we could not before, and in ways that are

not possible in wired networks due to lack of connectivity proximity.

This study introduces a systematic framework and new protocol for gathering and

processing the above information to gain confidence and trust1 . Our protocol is fully

distributed, self-bootstrapping, and integrates attack resilience mechanisms. The core

of our method utilizes a trust adviser algorithm that employs a set of parameterized

trust filters. The trust filters analyze mobile encounters, proximity, location, and context

data in novel ways, to augment the users network view and awareness. Its goal is to

identify opportunities of trust (or attack prevention) based on weighted filter scores that

are coupled with the users input and encounter keys to build a trustworthy node list.

1 We shall use the term ‘trust’ to indicate confidence and opportunities to exchangeencounter-keys in mobile networks.

68

Page 69: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Focus is given to the investigation of the relationship (or lack thereof) between behavior

similarity (i.e., network homophily) and trustworthy mobile networking.

Effective establishment of trustworthy networked mobile communities can enable

several potential applications; including mobile social networking, formation of interest

communities and support groups (in health care, education), localized response and

emergency notification, context aware and similarity-based networking [8, 39], and worm

vaccination [77].

Our protocol’s mechanistic design and implementation strive to achieve the following

main design goals: stability, scalability, efficiency, distributed operation, and resilience. In

addition, careful thought is given to utility, accuracy and simplicity of the application.

Evaluation of the proposed trust adviser filters and app is a three-phase process: i-

real world mobile networks trace statistical analysis, ii- extensive trace-driven simulation

of the framework components, and iii- prototype implementation and participatory testing

on smartphones. First, we use wireless network traces from 3 different major university

campuses spanning 9 months with over 70K users and 150 million encounters. We find

that several filters possess desirable stability characteristics, and that trust scores in

general form a small world. Resilience to attacks (using anomaly detection) achieves

less than 10% false positives and 7% false negatives. Second, we measure the

effectiveness of ConnectEnc on epidemic routing in DTN with selfishness using the

new trust routing engine, and obtain stable trust routing without the sacrifice of network

performance. Third, we conduct a series of surveys and participatory experiments

to evaluate the performance of ConnectEnc against the ground truth. We find users’

willingness to trust others in a mobile network has a statistically strong correlation with

their behavioral similarity. Further, ConnectEnc filters can capture 80% of the already

known user within top 25% of the encountered users.

Key contributions of this work include: 1. introducing a framework to augment

mobile user’s perception and awareness of the network neighborhood by fusing

69

Page 70: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

multi-dimensional encounter and contextual data, 2. analyzing various trust adviser

filters with extensive network traces, 3. propose a model for anomaly or attacker

detection, 4. developing a mobile app ‘ConnectEnc’ that integrates the filters and

contextual information to aid user trust classification, and 5. deployed ConnectEnc as

proof-of-concept and to evaluate the system based on ground truth via participatory

testing.

4.1 Related Work

Several researchers have proposed novel approaches to establish trust and

cooperation in ad hoc and DTNs using credit and reputation based schemes, incentive

based schemes, and game theory.

The reputation based schemes target better peer selection based on previous

interaction records and transfer by rating trust and cooperation to nodes in a mobile

ad hoc network. In [20], a node detects misbehavior locally by observation and use

of second-hand information. In [19], a fully distributed reputation system is proposed

that can cope with false information, where each node maintains a reputation rating

and a trust rating for other nodes. In [14, 29, 70], analysis of rewards provisions and

punishment is conducted based on game theoretic approaches to provide incentives

for message delivery. In [13], authors derive performance and optimization statistics to

measures the success in delivery probability for a message covering both cooperative

and non cooperative scenarios. The study in [67], analyzes the effect of cooperation on

three different routing algorithms. The authors investigate the performance of epidemic,

two-hop relaying and binary spray and wait routing to model a node’s cooperation

probability to either drop or forward a message. The incentive based credit schemes

rank trust for neighboring nodes. In [22], authors propose a game-theoretic model to

discourage selfish behavior and stimulate cooperation by leveraging Nash equilibria with

socially optimal behavior. In [84], authors propose a pricing mechanism to give credits

70

Page 71: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

to nodes that participate in the message forwarding mechanism. The cooperation is

developed based on the number of messages transfered by the users.

A common theme in these works is the reliance on device interaction to evolve

the trust scores. Inherently, this creates an undesirable circular dependence, where

interaction requires technology adoption (say of ad hoc networks or DTNs), which -

in turn - requires trust. Hence, there is a compelling need for a bootstrap mechanism

for trust, which we directly address in our design. Furthermore, other studies do not

utilize encounter context which we do focus on in this paper. Our work contributes

towards solving this challenge by providing inputs from user’s location preferences

and contextual (e.g., social) behavior. It then uses the trust established using iTrust

to establish further trustworthy communication in various types of mobile networks,

including, but not limited to, ad hoc networks and DTNs.

Message delivery mechanisms in ad hoc, sensor and delay tolerant networks

necessarily require node cooperation. However, in reality due to selfishness or lack of

trust some nodes may not cooperate. Lack of cooperation, in such cases, may largely

disconnect or partition the network. Such selfish nodes (or free riders) [61] could exploit

network services but refuse to forward messages. An analytical model that builds the

concept of trust is discussed in [42, 58]. The authors show trust supports cooperation

and is heavily based on the interactions and bonds that govern behavior in ad hoc

and opportunistic scenarios. Other approaches discussed in [24, 55, 59] propose

explicit authentication mechanism to generate trust and cooperation in network. These

approaches are better modeled for small groups [55] and require exchange of public

keys and the installation of the private key on the users device [24]. We shall borrow

from these works for the establishment of opportunistic encounter keys in our trust

framework.

A few studies [53, 56] have attempted to use encounter information to route in

DTNs. These protocols seem to contribute towards improved prediction and routing in

71

Page 72: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

DTNs. However, the relation between encounters and trust has not been investigated

in terms of the ground truth. The focus of this work is to establish such relationship

between encounter statistics, stability, location, and context and trust through thorough

systematic analysis as well surveys and experimentation. Ours is the first work, we are

aware of, to contribute to this area of research.

4.2 Architectural overview

In this section, we describe the design goals and major components of iTrust and

their functionality. We begin with design goals, then present a high level diagram of

the design in this section. We then proceed to describe all the modules in the following

sections.

4.2.1 Design Goals

The main design goals for the iTrust protocol include:

1. Accuracy - The recommendations should be as close to users perception aspossible. We achieve this by utilizing state of the art trust advisers and adaptingrecommendations based on the users’ usage of the protocol.

2. Robustness - The trust recommendation should be stable over time and insensitiveto minor, temporary changes and noise in user behavior. Outliers and anomaliesshould be detected and removed.

3. Energy Efficiency - Mobile devices are energy constrained. iTrust should striveto minimize use resource of the device in terms of computation, storage andcommunications.

4. Distributed Operation - iTrust should be able to provide all the functionalities in adistributed fashion without the need for a centralized infrastructure or trusted thirdparty.

5. Privacy-Preservation - the usage of the protocol should not affect the privacy of theuser. All operations should be performed locally on the user’s device. Informationabout user, if any, should be send out of the device only on user’s command.

6. Resilience - The system should function properly in the face of intrusion attacksand selfishness. We propose an anomaly detection technique to avoid intrusionattacks. Selfishness, or lack of cooperation, is investigated and analyzed especiallyin the context of ad hoc and delay tolerant networking.

72

Page 73: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Figure 4-1. Block Diagram overview of the iTrust architecture. Dotted lines indicatemodules needed by iTrust. Shaded blocks indicate modules discussed in thiswork.

Other goals include: the ability of the protocol to augment and integrate with other

reputation and credit based trust systems, the capability to bootstrap trust (without

requiring device cooperation), and flexibility to utilize other user preferences and

information (through external sources and social networks) in the future.

4.2.2 Overall Design

Fig. 4-1 provides an architectural overview of the iTrust framework and its

interconnections with related subsystems. The main component of the iTrust engine

(shaded blocks) includes: a. trust adviser filters, b. trust recommendation generator,

c. weight generator, and d. anomaly detector. Other modules (inside the dotted line)

needed by iTrust include: a. radio scanner, and b. locator.

The ‘Trust Adviser Filter’ is the block that generates trust scores using a family

of filters (described in the next section). The different trust lists (produced by different

filters) are fed into the ‘Trust Recommendation Generation’ module. This block combines

all the trust filter results with the input from anomaly detection, recommendation system,

73

Page 74: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

reputation system, and black and white lists using the weights generated by the ‘Weight

Generator’. The ‘Weight Generator’ uses built-in weight scores and adapts itself using

the selections made by the user. The ‘Anomaly Detection’ provides a recommendation

regarding suspicious encounter activities. This can also take user’s input if needed. The

‘Short Range Radio Scanning’ module provides basic encounter information. Similarly,

the ‘Location Information’ module provides the device’s positioning data to ‘Trust Adviser

Filters’. Other modules such as ‘Reputation’ and ‘Recommendation’ provide extra

functionality and would be based on already existing techniques.

With this conceptual understanding of the system, we now describe each of the

module shown in Fig. 4-1.

4.3 Trust Adviser Filters

The trust adviser filters constitute the heart of iTrust. Its function is to provide

meaningful, stable scores of trust for encountered devices.

The primary motivation of our work is to : a. encourage interaction in mobile

societies and adoption of new mobile services (e.g., mobile social networks) b. establish

network connectivity in the context of ad hoc networks and DTNs. Trust can inspire

cooperation in networks, particularly in infrastructure-less networks. Here, trust means

that a user: 1. is willing to interact through the network with trusted nodes, and 2. In

DTNs, is ready to accept a message for the trusted user and genuinely attempt to route

it. To develop trust between a pair of users, we leverage proximity of mobile users (when

the devices come within radio range) and encounter information, location and context.

Several properties of nodal encounter behavior have been investigated in [40].

Our primary reasons for choosing encounters and proximity as measures to

generate trust are inspired by the work on homophily [60]. The principle of homophily

suggests a strong correlation between similarity of interest and frequency of meeting

and interaction. Trusting frequently encountered users would mean trusting similar

people (e.g. work colleagues or classmates). This trust can have social incentives too.

74

Page 75: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Second, when users are within the radio range of each other (for Bluetooth it is ∼15m),

they can potentially exchange out-of-band information including identity information and

cryptographic keys [24]. Such proximity-based out-of-band information exchange is

not possible in wired networks (inherently relational graphs, as the two terminals may

be geographically far apart) but can be utilized in mobile networks (inherently spatial

graphs).

The challenge is to find methods that can successfully discover potential similarities

between the users. We refer to these methods as Trust Adviser Filters. In the implementation,

a user would decide on which users to trust and the filters would serve as an adviser.

Thus, users would have full control over the selection of trusted users. These filters

would act as the scoring system that recommends users who are most similar to the

user. We have classified the filters into two major categories (Aggregation and Behavior

based) based on the similarity they measure. A third category of filters (Hybrid Filter)

combines results from the two main group of filters to produce a trust score.

4.3.1 Aggregation Based Similarity

These filters aggregate the encounter data using statistical methods and provide a

measure of encounter-based similarity. We present two such filters based on frequency

and duration of encounters.

4.3.1.1 Frequency of Encounters (FE)

One of the basic filters to estimate similarity between a user-pair is the number of

times they encounter (An encounter is defined as the event where a device is in radio

range of another device to allow device discovery ). This filter assumes that the more

devices meet, the more similar (and are more trustworthy) they are. On this assumption

(which may not be always valid), we design the FE filter that counts the frequency of

the encounters of the user with all the other users. To get the trust list from FE filter for

a user, we sort all the encountered users by their number of encounters and select top

users based on the trust (T ) value.

75

Page 76: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

4.3.1.2 Duration of Encounters (DE)

The percentage of time spend by a user with another user is another measure

of similarity. The more the time spent together by the users, the more similar (and

trustworthy) they are likely to be. On this basis we design the DE filter to keep count

of the duration of time spent by a user with all the other users. From the ordered list of

duration of encounters for the user, DE filter selects top trusted users based on the T

value.

4.3.2 Behavior Based Similarity

Behavior based similarity measures similarity based on location visitations and

preferences. We couple location information with encounters to determine the similarity

between users.

4.3.2.1 Profile Vector (PV):

To capture behavioral characteristic, we have designed PV filter that stores location

visitations of a user in a single dimensional vector. It is assumed for this filter that a

device has some localization capability, which is quite common for today’s devices. Each

device maintains a vector. The columns of the vectors represent the different locations

visited by a user and the values stored in each cell indicate either duration or count of

the sessions at that particular location. At each location visit, the vector is updated with

respect to the location.

To get similarity score, this vector is exchanged with other user and the inner

product of the two vectors is computed. This similarity score is higher if the two PVs are

similar and can be zero, if the users do not have any visited location in common. Here,

implicit weight is given to locations based on the count/duration spend. We can also

provide an option to the user, where the user can give weights to the locations explicitly.

However, this filter is not privacy preserving and can introduce attacks in the

system, where a user can tamper with its vector, also there are communication costs

76

Page 77: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Figure 4-2. Location Vector LV for a user

involved in exchanging the vectors. This problem in solved by LV filters at cost of having

lesser information to compute similarity scores with.

4.3.2.2 Location Vector (LV):

LV filter is very similar to PV, except that a user not only maintains a vector for itself

but also for each of the encountered users. The columns of the vectors represent the

different locations visited by a user and the values stored in each cell indicate either

duration (LV-D) or count (LV-C) of the sessions at that particular location. For every

encounter, the vector for the encountering node is updated with respect to the encounter

location. Illustration in Fig. 4-2.

Since vectors for all the encountering users are maintained locally on the device,

LV requires no exchange of vectors among users for calculating similarity. This is more

privacy-preserving and more resilient to attacks since only first-hand information is

used (equivalent to what user might have observed). This privacy comes at the cost of

requiring extra storage space for storing vectors for each user. Considerable storage

optimization is achieved by storing (for each encountering user) only the locations where

encounters happened. Similarity calculations are similar to PV.

4.3.2.3 Behavior Matrix (BM)

The behavior matrix captures a spatio-temporal representation of user behavior.

Columns of the behavior matrix denote a location and rows represent a time unit (here

the time unit is taken as a day for simplicity). The value stored at each cell is a fraction

of the on-line time spent by the user at a particular location on a particular day (see

77

Page 78: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Fig. 4-3). Each user maintains their own matrix. To get the similarity score, users can

exchange and compare the two matrices.

To make the behavior similarity check efficient (in terms of space and computation

complexity) and privacy preserving (as only the summary of matrix is exchanged), we

use the eigen values of the behavior matrix for exchange between the two users. The

eigenvalues are generated using SVD (Singular Value Decomposition). SVD is applied

to a behavior matrix M, such that:

M = U ·Σ · V T , (4–1)

where a set of eigen-behavior vectors, v1, v2, ..., vrank(M) that summarize the important

trends in the original matrix M can be obtained from matrix V , with their corresponding

weights, wv1,wv2, ...,wvrank(V ) calculated from the eigen-values in the matrix Σ. This set of

vectors is referred to as the behavioral profile of the particular user, denoted as BP(M),

as they summarize the important trends in user M ’s behavioral pattern. The behavioral

similarity metric between two users’ association matrices A and B is defined based on

their behavioral profiles, vectors ai ’s and bj ’s and the corresponding weights, as follows:

Sim(BP(A),BP(B)) =

rank(A)∑i=1

rank(B)∑j=1

waiwbj |ai · bj | (4–2)

which is essentially the weighted cosine inner product between the two sets of

eigen-behavior vectors.

4.3.3 Hybrid Filter (HF)

Each filter provides a different perspective on an encounter or behavioral aspect.

The hybrid filter provides a systematic and flexible mechanism to combine the scores

from all filters and present a unified score to the users. The selection of weights for

various filters would depend on several factors including user’s preference and feedback

(check Sec. 4.6.1) and application requirements. A generic Hybrid Filter score (H) for a

78

Page 79: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Figure 4-3. Behavior Matrix for a user

user Uj can be generated by using the following:

H(Uj) =

n∑i

αiFi(Uj) (4–3)

where Fi(Uj) is the normalized score for user Uj according to filter i . The αi is the weight

given to filter score Fi and n is the total number of filters used. We select αi such that∑αi = 1, and 0 ≤ αi ≤ 1.

Note that our design (Fig. 4-1) provides feedback to the system based on user

selections. This feedback can be used to make the weights adaptive.

Decay of filter scores: Social science studies have shown that social relationship

are dynamic and require frequent interactions to prevent decay. The strength of

relationship wanes with the increase in time between interactions. This decay follows

a exponential decay pattern with half time dependent on the relationship type [21] (3.5

years for family, 6 months for colleagues). Configurable decay was integrated in our

ConnectEnc app with default halftime set to 6 months.

79

Page 80: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Table 4-1. Overhead of Filters in terms of processing and storage. Here m is the totalno. of records in the encounter file, n is the no. of unique encountered user, lis no. of locations visited d represents the no. of days used for BMcalculations. We also assume that m >> n.

Filter Processing Overhead Storage OverheadFE O(m) O(n)DE O(m) O(n)PV O(m) O(l)LV O(m) O(nl)BM O(m) O(ld2) for SVDHF O(n) O(n)

4.4 Anomaly Detection

Incorporating resilience to attacks is a primary requirement for our design. Here, the

attack on the trust system includes an attempt by an untrusted user (e.g. a stranger)

to gain trust of the system in a relatively short time by injecting many encounter

events (e.g. via stalking). A growth of trust scores in this fashion can be considered

an anomaly, and a specialized anomaly detection system is needed to combat such

attacks. Since iTrust scores individual encountered nodes, at present we consider single

attacker scenarios.

An attacker would want to get onto the trusted list as soon as possible to have

high effects for limited effort. The goal of the trust system design would then be to

considerably raise the level of effort needed for a successful attack, to be no less than

genuine trusted nodes and friends, which may entail weeks of consistent encounters

at trusted locations by the attacker. The spatio-temporal granularity used in our adviser

filters determines such attack effort and provides us with the anomaly we aim to detect.

Note that in our implementation, a user can opt to approve or remove any trusted node

before being added to the trusted list. The role of anomaly detection would then be to

raise a red flag when an attack is suspected.

Our anomaly detection approach investigates the evolution of encounter patterns

and trusts over time, and does not require information exchange between nodes.

80

Page 81: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Normal operation is observed where regular users are encountered over time. The

anomaly detection mechanism considers the slope of the growth of encounter statistics

(including frequency, duration or behavioral similarity as defined by the trust adviser

filters). The detection system learns normal behavior over time, and incorporates

deviations from the normal to detect suspect nodes and trigger user alerts. Admittedly,

this approach has promise when the user’s behavior is considered normal. In situations

where encounter patterns fluctuate considerably (e.g., during irregular events, trips or

city change), a re-evaluation of this approach is warranted (part of future work).

4.4.1 Detection Model

For attacker detection, we integrate scores from various filters with location

information as available. For example, using FE , the slope and standard deviation

of growth of trust score per user can be used to identify outliers, marking them as

attackers. If we consider the LV filter, attackers can be identified by comparing the

differences in scores based on locations (users encountered at more locations than

others).

Here, we use FE filter as an example to design the anomaly detection system. For a

user, we define a function, FE(i ,T ) that yields the FE score for encountered user i after

time T .

Since we are calculating slope of trust score growth over time, time interval needs

to be defined in two ways and therefore slope will be defined in two ways. Time interval

can either be total number of days since the first encounter or it can be the sum of

number of days when encounter happened. These two methods are necessary to

ensure that an attacker who waits for a long duration after an initial encounter with the

user to have multiple encounters in a short time does not go undetected because of a

slow growth slope. Two slopes are called γ1 and γ2:

γ1i(T ) =FE(i ,T )

C1(i ,T )(4–4)

81

Page 82: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

γ2i(T ) =FE(i ,T )

C2(i ,T )(4–5)

where, function C2(i ,T ) gives the number of days since the first encounter with

user i and C1(i ,T ) gives the sum of number of days when encounter happened with

user i .

To detect attacker from other users, we propose to select neighbors of user i in

terms of number of encounters with the user, creating a set Si ,T . Here, Si ,T is a set of all

users k (0 < k < n and k = i ) who encountered the user and also satisfy

|FE(i ,T )− FE(k ,T )| ≤ x , (4–6)

where x is the parameter set by the user. The input, x , determines the number of users

(size of neighborhood) to consider to identify an attacker (anomaly). The users in Si ,T

are similar to suspect user i as they have similar FE score. Through the users in Si ,T ,

we can determine the mean and standard deviation of the slope of the neighbors. Mean

(µ1) and standard deviation (σ1) can be calculated as shown below (µ2 and σ2 can be

obtained similarly)

µ1 =

∑uϵSi ,T

γ1u(T )

|Si ,T |(4–7)

σ1 =

√√√√√√∑uϵSi ,T

(γ1u(T )− µ1)2

|Si ,T |(4–8)

A user, i , is classified as an attacker if the slope (γ1 or γ2) is greater than µ1 +

(κ × σ1) or µ2 + (κ × σ2). Here, κ is a multiplying factor ranging from 1 to 3 that we

need to investigate with the traces to discover the optimal performance of our detection.

This kind of detection is refereed in academic literature as Nearest Neighbor based

anomaly detection [23]. In essence, we consider all the encountered users who have

82

Page 83: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

score similar to the user being evaluated (neighbor size controlled by x) and if the slope

(both measures) of the user is different than that of the neighbors, the user is suspected

to be an attacker and flagged for evaluation.

4.4.2 Attacker Model

Evaluating the anomaly detection system designed in Sec. 4.4 is challenging as we

do not know how to model attackers. We assume that before this service is available for

general use, this kind of attack would not happen. So the traces we have will not have

any patterns belonging to the attacker we have discussed here. This makes detection

and validations difficult. To deal with this challenge, we present here an attacker model

created so as to beat the anomaly detection we earlier designed (it is just one of the

possible models for attackers).

We have created a parametrized model for the attacker, based on number of

encounter, Max days available and periodicity of encounters. Number of encounter, is

the number of encounters an attacker will have. In the simulation this number is kept

close to the minimum number of encounters needed to overcome the trust threshold.

Max days provides the length of period in which attacker can have encounters. Period-

icity of encounter provides the pattern of encounter information. The attacker follows a

periodic encounter pattern as it has been shown by studies (cite sungwook globecom)

that users show periodic encounter behavior (such as weekly pattern). However, the

period may vary from user to user. The attacker would like to follow the pattern displayed

by other encounters so as to reduce suspicion. (Even though, in reality, attacker may

only guess and not accurately get the periodicity information). In our work here, we have

considered time granularity of days i.e. we consider cumulative encounters on per day

basis. It is also possible to take seconds, minutes, hour, or week based time granularity.

The effects of changing time granularity are not discussed in this work and are left as the

future work.

83

Page 84: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Using the periodicity information, we can identify the days during which attacker

has encounters (restricted by Max days). Then we distribute the sessions equally to

each day of encounters. In our simulations we vary Max days from 1 to 30 (the trace

we consider for anomaly detection is 30 days long). For each value of Max days, we

compute the attacker pattern (AP). This AP is then injected back into the traces and

anomaly detector is run on the entire traces to detect it. The Algorithm 1 describes the

model used for the attacker.

Input: time period allowed for attack (MaxDay), average days (AvgDay), Numberof Encounter (NumEnct)

Output: Attacker Pattern (AP[])

for i ← 0 to MaxDay doAP[i]← 0

endEncDay← NumEnct / (AvgDay ≤ MaxDay ? AvgDay:MaxDay) ;period← ceil(MaxDay /(AvgDay ≤ MaxDay ? AvgDay:MaxDay) - 0.5) ;left← 0for i ← 0 to MaxDay , Steps = period do

if AvgDay == 0 thenBreak ;

endAP[j]← = EncDay ;left← left + EncDay ;AvgDay← AvgDay - 1 ;

endleft← NumEnct - left ;j← 0 ;while left != 0 do

ap[j]← ap[j] + 1 ;left← left - 1 ;j← j + period ;if j ≥ MaxDay then

j← 0 ;endfor j ← 1 to MaxDay do

ap[j]← ap[j] + ap[j-1] ;end

endAlgorithm 1: Algorithm of Attacker model for Anomaly detection

84

Page 85: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

0 5 10 15 20 25 30

0

20

40

60

80

100

120

140

160

180

200

220

240

260

280

Sco

re o

f FE

filte

r

Time in Days

Figure 4-4. The growth of trust score using FE filter for a specific user. Each linecorresponds to an encounterd user.

0 5 10 15 20 25 30

30

40

50

60

70

80

90

100

110

120

130

Sco

re o

f FE

filte

r

Time in Days

Figure 4-5. The growth of trust score using FE filter using the attacker model. Each linecorresponds to an instance of attacker generated by the model.

85

Page 86: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Fig. 4-4 shows a sample trust score growth using FE filter. We can observe how

trust score for encountered users progresses for a user. We notice that around day 13,

17, 23, 25 most of score curve is slanted and its in these days where score increases.

The score change (or encounters) does not happen everyday, instead it happens on

certain days and it is broadly periodic. This periodic property is captured in our attacker

model too. Fig. 4-5 shows multiple attacker encounter patterns specific targeted for

a specific user. The curves are similar to the previous figures, however more dense

(because number of patterns here are much more than Fig. 4-4) and score building

starts early (This should not make a lot difference as in slope calculations γ2 we only

consider exact days of encounter, so total length of period will not have any effect).

4.5 Trace Based Evaluation and Analysis

In this section, we evaluate the design of iTrust filters including anomaly detection

and analyze the effects of recommendations on DTN routing with selfish nodes. Since

much of the following analysis use WLAN traces, we begin with describing the traces

used and then proceed to the evaluation.

4.5.1 Traces

To evaluate our design, we consider anonymzied trace sets from three universities

(see Tab. 4.5.1 for more details; the information provided in the traces is anonymized).

Tab. 3.1 shows a sample trace used in this work. The advantage of using WLAN

traces is that they are much closer to reality in terms of user mobility than the existing

synthetic mobility models. However, these traces, much like other real traces, have small

percentage of noise and error. We assume that users associating to same wireless

access point encounter each other as the range of an access point is generally less

than 50 meters in an indoor environment and most of the traces are from indoor usage.

Also, since only a few users may change/modify the MAC address of their devices, we

assume that a MAC address uniquely identifies a device and is always associated to a

single user. There could be a few users who share the devices.

86

Page 87: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Table 4-2. Facts about studied tracesTrace Source U1 USC [41] Dartmouth [7]

Time/duration of trace Fall 2007 Spring 2007 Fall 2005Start/End time 09/01/07-11/30/0701/01/07-03/30/0709/01/05-11/30/05

Unique Locations 845 APs 137 buildings 133 APsUnique MACs analyzed 34694 32084 4906

0 5,000 10,000 15,000 20,000 25,000 30,0001

10

100

1,000

10,000

100,000

Users

Num

ber

of E

ncou

nter

s

A. Frequency of Encounter (FE )

0 5,000 10,000 15,000 20,000 25,000 30,0001

10

100

1,000

10,000

100,000

Users

Ave

rage

Enc

ount

ers

Dur

atio

n(in

sec

)

B. Duration of Encounter(DE )

0.0 0.2 0.4 0.6 0.8 1.0100

1000

10000

100000

1000000

Use

r Pai

rs

Similarity Score

C. Location Vector using (LV −D)

0.0 0.2 0.4 0.6 0.8 1.0100

1000

10000

100000

1000000

Use

r Pai

rs

Similarity Score

D. Behavior Matrix (BM)

Figure 4-6. Similarity score for various filter for all the encountered pairs of users in Nov2007 from U1 trace

87

Page 88: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

0 1 2 3 4 5 6 7 8 9 10

30

35

40

45

50

55

60

65

70

75

Sim

ilarit

y P

erce

ntag

e

Length of History in Weeks

DE vs FE DE vs LV-C DE vs LV-D FE vs LV-C FE vs LV-D LV-C vs LV-D DE vs BM FE vs BM LV-C vs BM LV-D vs BM

A. U1

0 1 2 3 4 5 6 7 8 9 10

25

30

35

40

45

50

55

60

65

70

75

80

85

90

Sim

ilarit

y P

erce

ntag

e

Length of History in Weeks

DE vs FE DE vs LV-C DE vs LV-D FE vs LV-C FE vs LV-D LV-C vs LV-D DE vs BM FE vs BM LV-C vs BM LV-D vs BM

B. Dartmouth

0 1 2 3 4 5 6 7 8 9 1025

30

35

40

45

50

55

60

65

70

75

80

85

90

95

Sim

ilarit

y P

erce

ntag

e

Length of History in Weeks

DE vs FE DE vs LV-C DE vs LV-D FE vs LV-C FE vs LV-D LV-C vs LV-D DE vs BM FE vs BM LV-C vs BM LV-D vs BM

C. USC

Figure 4-7. Correlation between the trusted lists produced by various filters at T=40%

4.5.2 Filter Evaluations

Using the traces, four properties of the filters are investigated: 1. Ability of filters

to distinguish between different encounters (statistical characterization), 2. Correlation

among filter results, 3. Stability over time, and 4. Small world characteristics. Then the

results from anomaly detection are discussed.

To generate the trust scores from various filters, WLAN trace is converted to

encounter trace for each user by determining and storing all the other users who had

overlapping sessions with this user at the same access points (location). Filters take

88

Page 89: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

86

88

90

92

94

96

98

100

1 2 3 4 5 6 7 8 9

Sim

ilarit

y P

erce

ntag

e

Length of history in weeks

1 Week2 Week3 Week4 Week5 Week6 Week7 Week8 Week9 Week

A. Duration of Encounter(DE )

86

88

90

92

94

96

98

100

1 2 3 4 5 6 7 8 9

Sim

ilarit

y P

erce

ntag

e

Length of history in weeks

1 Week2 Week3 Week4 Week5 Week6 Week7 Week8 Week9 Week

B. Frequency of Encounter (FE )

86

88

90

92

94

96

98

100

1 2 3 4 5 6 7 8 9

Sim

ilarit

y P

erce

ntag

e

Length of history in weeks

1 Week2 Week3 Week4 Week5 Week6 Week7 Week8 Week9 Week

C. Location Vector - Count(LV − C )

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9

Sim

ilarit

y P

erce

ntag

e

Length of history in weeks

1 Week2 Week3 Week4 Week5 Week6 Week7 Week8 Week9 Week

D. Location Vector - Duration (LV −D)

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9

Sim

ilarit

y P

erce

ntag

e

Length of history in weeks

1 Week2 Week3 Week4 Week5 Week6 Week7 Week8 Week9 Week

E. Behavior Matrix (BM)

Figure 4-8. Comparison of trust list belonging to different history for various filters atT=40% (note that the y-axis scale for DE , FE , and LV − C starts at 85% andfor LV −D and BM the scale starts at 35%)

89

Page 90: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

encounter trace as an input and produce a ranked list based on the similarity measure

used by that filter. For analysis, we pick top T% users from these ranked list.

4.5.2.1 Statistical Characterization

The proposed filters are justified if the scores generated for encountered users allow

us to discern them. In this section, we consider one month long WLAN trace from U1

(other traces have similar characteristics) and present the distribution of trust scores for

all filters, for all the encountered pairs (see Fig. 4-6, LV-C’s characteristic are similar to

LV-D).

We notice that for FE filter, 3,000 users have over 1,000 encounters each in

a month and more than 15,000 users (over 2/3 of the population) have over 100

encounters. Similarly, for DE filter, average encounter duration for more than 20,000

users is over 1,000 seconds. Results from LV-D filter show that large number of user

pairs have low score(close to zero), which may mean that most of the users are not

similar to each other and we see that only a few user pairs have a high similarity score.

BM filter score, like LV, is close to zero for most of the user pairs and is high for a few

user pairs. These results justify our choice of filters for distinguishing encountered users.

4.5.2.2 Correlation

We examine the degree of similarity (correlation) among trust lists from different

filters. While high similarity indicates redundancy of the filters, low similarity implies

orthogonality of the trust recommendations. For this investigation, we have considered 9

week long traces and created trust list at T = 40% for varying length (at 1 week interval)

of encounter history (results for other T values show similar trend).

As Fig. 4-7 shows, the trends are similar across the traces. LV −D and LV −C filter

results show ∼70% similarity as the list stabilize around 9 weeks of history. FE v.s. DE

stabilize around 60% to 70%. Rest of the filters stabilize between 55% to 30%, meaning

they produce different sets of trust list. The low similarity indicates that filters are not

redundant and can be used to generate rich set of recommendations.

90

Page 91: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

4.5.2.3 Stability

Fluctuations in trust recommendations over time could confuse users. Therefore, it

is imperative to examine stability in the trust recommendation over time. We investigate

the stability of trust lists at T = 40% using 9 weeks of U1 traces (other T values and

traces show similar trend). Trust list comparison from multiple length of traces is used to

examine stability.

More than 90% similarity is found between 1 and 9 weeks trace for DE, FE and

LV-C filters (see Fig. 4-8), implying that users selected in 1st week of encounter

continued to be in the trust list of 9 week long encounter history. BM filter shows

high stability when the difference in history is less than 2 weeks ( 80%) and falls to

55% for 1 week and 9 weeks. The LV-D filter shows similarity of about 40% between

any list, implying that every week the list changes by 60%. This indicates that users

may encounter regularly (by stability in LV-C) but may spend different amount of time

encountering over the weeks. Overall, we note that some filters (DE, FE, and LV-C)

stabilize in just 1 week of history, which makes them suitable for recommendations when

trust history is short. The time interval between the trust list regeneration can also be

long (reducing processing requirements). As the stability of LV-D filter is comparatively

low, we may need to redo the trust list weekly.

4.5.2.4 Graph Analysis

We analyzed the effect of trust on the network graph and compared it with the

regular and random graphs while increasing trust (T )(using DE filter, other filters show

similar results). An edge is added between a pair of nodes only when atleast one of

them trusts each other (un-directed graph). We note that clustering coefficient (CC) [11]

of the network increases with T% and the path length (PL) decreases with increase in

T%. For e.g using 9 week U1 trace, CC is 0.171 at T = 10% and becomes 0.201 at

T = 100%. However, in the same scenario Path Length decreases from 3.64 to 2.59.

More than 99% of the nodes were connected even at T = 10%.

91

Page 92: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

0 20 40 60 80 1000.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 Week 1 NCC Week 1 NPL Week 2 NCC Week 2 NPL Week 3 NCC Week 3 NPL Week 4 NCC Week 4 NPL Week 5 NCC Week 5 NPL Week 6 NCC Week 6 NPL Week 7 NCC Week 7 NPL Week 8 NCC Week 8 NPL Week 9 NCC Week 9 NPL Week 10 NCC Week 10 NPL

Nor

mal

ized

Clu

ster

ing

Coe

ffici

ent &

Pat

h Le

nght

Trust Percentage

A. UF

0 20 40 60 80 1000.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

Clu

ster

ing

Coe

ffici

ent &

Pat

h Le

ngth

Trust Percentage

Week 1 NCC Week 1 NPL Week 2 NCC Week 2 NPL Week 3 NCC Week 3 NPL Week 4 NCC Week 4 NPL Week 5 NCC Week 5 NPL Week 6 NCC Week 6 NPL Week 7 NCC Week 7 NPL Week 8 NCC Week 8 NPL Week 9 NCC Week 9 NPL Week 10 NCC Week 10 NPL

B. Dartmouth

0 20 40 60 80 1000.0

0.2

0.4

0.6

0.8

1.0

Nor

mal

ized

Clu

ster

ing

Coe

ffici

ent &

Pat

h Le

nght

Trust Percentage

Week 1 NCC Week 1 NPL Week 2 NCC Week 2 NPL Week 3 NCC Week 3 NPL Week 4 NCC Week 4 NPL Week 5 NCC Week 5 NPL Week 6 NCC Week 6 NPL Week 7 NCC Week 7 NPL Week 8 NCC Week 8 NPL Week 9 NCC Week 9 NPL Week 10 NCC Week 10 NPL

C. USC

Figure 4-9. Normalized Clustering Coefficient and Normalized Path Length

A small world analysis is performed as described in [11]. We find that normalized

CC (NCC) is close to CC of regular graph and the normalized PL (NPL) is close to PL

of the random graph (Fig. 4-9 shows NCC and NPL for different lengths of traces and

values of T ). It appears that network created by trust list to be a small world network.

4.5.2.5 Anomaly Detection

Here we analyze the effectiveness of the anomaly detection system we proposed.

Evaluating the anomaly detection system designed in Sec. 4.4 is challenging. Since

iTrust service is still not available, no attack patterns or models exist. Therefore, to

evaluate our anomaly detection system, we have created an attacker model (it is just

one of the possible models for attackers).

To mimic users’ encounter patterns that are periodic where the period is determined

by individual user behavior, we kept attacker’s encounters periodic. This period is

obtained from the victim’s encounter pattern (so attacker can avoid obvious suspicion).

92

Page 93: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

The number of encounters needed to get into a victim’s trust list is also known. The

only tunable parameter is the number of days in which attacker wants to achieve the

required encounter score (detailed algorithm is here [1]). To evaluate, we varied the

number of days from 1 to 30 (the trace is from U1 and 30 days long). Forty users were

analyzed (20 users have maximum number of encounters and 20 have average number

of encounters in the 30 days trace).

To validate our model and detection scheme we choose false positives and false

negatives as metrics. The percentage of regular users identified as attacker classify

as false positive whereas percentage of failures to identify attackers classify as false

negatives. As Tab. 4.5.2.5 shows, the percentage of false positive decreases as we

increase the size of set Si ,T and percentage of false negative increases as we increase

the κ factor for standard deviation (FE filter scores where used). We notice that best

detection occurs at κ = 1 and neighborhood size (or |Si ,T |)= 10. These results are

promising, yet warrant further analysis to optimize and create a better attacker model

and detection system (outside the scope of this study). However, the results show that

iTrust can work with anomaly detection and can flag suspected users.

Table 4-3. False positives and negatives while using the proposed anomaly detection (inpercentage)

κ |Si ,T | False +ve False -ve1 5 10.03 8.301 10 8.15 6.271 15 9.73 6.442 5 3.80 20.112 10 2.77 19.972 15 2.16 19.383 5 3.18 48.273 10 1.12 44.243 15 0.98 42.04

4.5.3 Selfishness & Trust Routing in DTN

DTNs as one of the network scenarios where iTrust can work. DTNs are infrastructure

less networks that work on the cooperation of the nodes. Since nodes spend their

93

Page 94: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

resources in routing messages, the nodes may only route messages for nodes they

know or when they have some incentives. In these scenarios (where nodes are selfish),

we find that using iTrust improves the network connectivity and routing performance.

To examine the effectiveness of iTrust, we introduce selfishness and use epidemic

routing [79] as a tool to study performance of a routing protocol over the WLAN traces.

The selfishness is defined as the probability (S) that a node will not accept and route

packets for a node it does not trust. Epidemic routing performs a controlled flooding

and has been proved to provide lower boundary in performance in terms of hops and

time needed. Epidemic routing also provides the upper bound on reachability. These

properties make it an appropriate tool for the purpose of our evaluations.

Fig. 4-10 shows the flow chart for iTrust routing inside each node. When a node

receives a message from a trusted sender, it accepts the packet and attempts to route

it. Otherwise, the node accepts the packet based on factors such as user-configured

selfishness. For our purpose, we have considered the acceptance of packets from

untrusted node based on the selfishness probability (S). For the purpose of simulation,

nodes are trusted (as recommended by iTrust) based on the T values .

The performance of epidemic routing is measured using three metrics : Unreachability,

Delay, and Overhead. We define Unreachability as the number of nodes out of all

receivers that could not be reached by a given source. Delay is defined as the ratio

of average time taken by a message to reach all the possible receivers over the max

possible delay. Finally, Overhead is the average number of hops a message took to

reach all the possible receivers using the shortest path. Since overhead and delay were

seen to vary directly with unreachability, we have not shown overhead and delay results

(they are available here [1]).

Fig. 4-11 shows the average unreachability for various combinations of trust and

selfishness using the DE filter (results from other filters show similar trend). Using first

60 days of traces, we create preliminary trust lists after which we run epidemic routing

94

Page 95: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Figure 4-10. Flow chart for iTrust routing

for a period of next 30 days. Trust lists are updated weekly during the run of epidemic

routing (to mimic a mobile device as computing trust list after every encounter or daily

would be resource intensive for the device). Around 800 nodes are randomly selected as

sources for the epidemic routing. During a round, only one node sends a message, and

we measure the unreachability of the message for that node. Each point on the graph

represents the average unreachability for 800 rounds (one for each sender).

Intuitively, selfishness should cripple the connectivity in the network. Fig. 4-11

shows that the network unreachability increases as S increases (and T = 0). To the

benefit of our scheme, we find that as trust is introduced in the network, the effect of

selfishness is reduced. Here we use trust list from DE filter (other filters show similar

trend). For U1, when T = 0% and S = 0.9, unreachability increases by 83% from the

case when S = 0. However, adding Trust T = 40% (S = 0.8) increases unreachability

to only 31% from the case when S = 0. Likewise, for Dartmouth, when T = 0 and

S = 0.9, unreachability increases by 40% from the case when S = 0. However, adding

trust T = 40% (S = 0.9) increases unreachability to only 10% from the case when

S = 0. For USC, T = 0 and S = 0.9 increases unreachability by 1.7% of the case when

S = 0. However, adding trust T = 40% (S = 0.9) brings unreachability to only 0.48%

from the case when S = 0. The effect of trust is higher when selfishness is high, which

makes iTrust more suitable in networks with high selfishness. The effect of trust is not

significant in USC traces, which could be a result of high unreachability in the network

95

Page 96: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

even at S = 0 (5 times of U1 or Dartmouth). Also, adding selfishness does not increase

the unreachability significantly for USC.

We now show the comparison between the performance of the filters and a few

possible Hybrid Filters (Sec. 4.3.3). For this purpose, we use the U1 traces (as the

trends from other traces are similar) and vary the weights from 0 to 2 (see Fig. 4-12).

The highest unreachability is produced by using only the BM filter score and the lowest

by using the FE filter. The combination of filters at equal weights has unreachability

close to FE filter and is better than either BM or FE . This analysis gives us two

important results. First, that combination of filter scores can produce better results

(an also avoids user confusion) than using individual filters and second, that by default

configuration iTrust can use equal weights for combining the filter scores.

4.6 Survey and Implementation Based Validation

To validate the approach of iTrust with the ground truth, we have employed surveys

and user feedback from iTrust application.

4.6.1 Survey

To investigate the trust needs of users and the importance they give to trust, we

conducted a survey at a major computer network conference. Even when this is a

biased sample of survey takers, this population has good understanding of computer

networks. We received 32 usable responses. Participants were asked to indicate their

willingness to communicate (using ad hoc or DTNs) under different scenarios on a scale

of 1 to 10.

As Fig. 4-13 shows, willingness of the users to cooperate with unknown user

is low (mean is 2.31). However, willingness increases when users have knowledge

about the encounter history. This reinforces the approach of iTrust of using encounters

to build trust in the network. We also observe that users give more importance to

combined scores (FE and DE score are high) than individual scores (FE is high or DE

is high). This justifies iTrust ’s use of Hybrid Filter for combining trust recommendations.

96

Page 97: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

0.1

0.11

0.12

0.13

0.14

0.15

0.16

0.17

0.18

0.19

0.2

T=0% T=20% T=40% T=60% T=80% T=100%

Unr

each

abili

ty

Trust

S=0.1S=0.2S=0.3S=0.4S=0.5S=0.6S=0.7S=0.8S=0.9

A. U1

0.12

0.125

0.13

0.135

0.14

0.145

0.15

0.155

0.16

0.165

0.17

T=0% T=20% T=40% T=60% T=80% T=100%

Unr

each

abili

ty

Trust

S=0.1S=0.2S=0.3S=0.4S=0.5S=0.6S=0.7S=0.8S=0.9

B. Dartmouth

0.554

0.555

0.556

0.557

0.558

0.559

0.56

0.561

0.562

0.563

0.564

0.565

T=0% T=20% T=40% T=60% T=80% T=100%

Unr

each

abili

ty

Trust

S=0.1S=0.2S=0.3S=0.4S=0.5S=0.6S=0.7S=0.8S=0.9

C. USC

Figure 4-11. Average unreachability with varying Trust and Selfishness using DE filter

97

Page 98: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

0.0 0.2 0.4 0.6 0.8 1.00.105

0.110

0.115

0.120

0.125

0.130

0.135

0.140

0.145 1111 1112 1121 1211 2111 1000 0100 0010 0001

Unrea

chab

ility

Selfishness (S)

Figure 4-12. Hybrid filter results when T=40%. Number on the legend indicated the ratioof score from each filter. For e.g. 1211 implies αDE = 0.2, αFE = 0.4,αLV−D = 0.2, and αBM = 0.2 and 0100 implies αDE = 0, αFE = 1,αLV−D = 0, and αBM = 0 (Sec. 4.3.3)

No Information

High FE, Low DE

High DE, Low FE

High FE AND DE

0

1

2

3

4

5

6

7

8

9

10

Will

ingn

ess

to C

omm

unic

ate

Communication Scenarios

Figure 4-13. Survey Results showing user’s propensity to communicate with other usersin various communication scenarios

Standard deviations in results suggest that although most users want information about

encountered users before cooperating, the individual importance of the filters may vary.

This flexibility is made available in iTrust ’s Hybrid Filter by assigning weights according

to user’s preference.

98

Page 99: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Figure 4-14. Illustration of iTrust’s component and their interactions

4.6.2 iTrust Application

To show the viability of iTrust and to validate our design with user studies, we have

implemented most of the core features of iTrust for mobile platform. Currently, iTrust

is available for Android platform and Linux based Nokia Tablet N810. It provides the

ability to rate encounter users based on FE, DE, LV and Hybrid filters. Encountered

users can be sorted by any filter and weights for the Hybrid filters are user configurable.

If some of the encountered users are currently discoverable, their listing would have a

green circular mark as shown in Fig. 4-15A.. The application provides inbuilt facilities

for scanning Bluetooth devices and Wireless Access Points (for localization as GPS is

energy-wise expensive. User can select GPS, if needed).

On selecting a particular user, encounter details (Fig. 4-15B. are presented and

clicking on the map option one can see encounter locations on map (Fig. 4-15C.).

Encountering devices can be rated for trust by the user on the scale from -2 (no Trust) to

2 (high Trust). This allows users to store their evaluations for encounter devices and can

be also used by other applications on the user’s device.

99

Page 100: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Application block diagram is shown in Fig. 4-14. The arrows in the diagram

represent how the encounter data flows in the application. The basic blocks of iTrust

are Bluetooth and Wi-Fi Scanning. Bluetooth scanning is used to discover and record

Bluetooth devices and Wi-Fi scanning is used to obtain localization information. Traces

from both the scanners are then parsed and giving to the filters. Encounters are then

rated and ranked by filters and based on the weights for the hybrid filter, a combine

score is also generate and saved. User can also choose to update locations which

entails going to third party server such as Google and Skyhook to get location data

based on the Wi-Fi AP data (users can also switch to more power hungry GPS for

localization). This allows users to visualize encounters on a map.

In the application, we have also added an optional discovery service that can

show more information about the encountering user such as name, email, social profile

link and personal web page. This service can allow users to weed out potentially

uninteresting/unsuitable encountering users before initiating contact and key exchanges.

The Fig. 4-15D. shows how a user can register the device and provide information

about him/herself so that other encountering users can find more about this user. When

privacy option is selected, the information is only shared to a user when approved by

this user. For looking up information from this registry, users have to click at the name of

the encountered device on the screen showing encounter details.

4.6.2.1 Application Evaluation:

We asked a group of 30 students (grad and undergrad) from CS major to run iTrust

app for a month. Users already owning android phone, ran iTrust on their phones, rest

were given Nokia N810 devices. Users were asked to mark devices they trust in the

application. Finally, out of the 30 students we received usable traces (at least one month

long) from 22 users. On average, number of trusted user marked by each user is 15 and

number of unique devices encountered per user is 175. We use this data to investigate

if behavioral similarity as captured by the trust filters is correlated to trusted user

100

Page 101: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

A. B. C. D.

E. F. G. H.

Figure 4-15. Screenshots of iTrust application. Fig. A shows the main screen whereencounter users are sorted by the filter score. Current encounters markedwith Green circles. Trusted users are shown in Blue color. Fig. B showsdetails for an encountered user. Fig. C shows user encounters on Map.Fig. D shows the registration screen for optional users informationdiscovery service. Fig. E shows screen where display order of encounteredusers can modified. Fig. F shows the screen to select weights for theHybrid filter (in the app it is referred as combined filter). Fig G. Shows thescreen where user can check self statistics regarding encounters. It alsoshows the number of scans saved due the use of energy efficient scanner.Fig H. Shows the menu. Menu allows the user to jump from one screen toanther.

101

Page 102: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

A. B. C. D.

Figure 4-16. Continuation of screenshots of iTrust application. Fig. A shows the settingsscreen. Fig. B shows number of encounters the user had with a particularuser over a period of time. This feature allows a user to know more aboutencountering users. Fig. C shows a graphs from the Self-Stat screen of theapplication. Here the graphs show the total number of encounter this userhad with respect to time. Fig. D shows the about page with authorinformation and web link for iTrust.

identification . We note that not all encountered users who may be trusted/non-trusted

may have been marked. Also only the discoverable bluetooth devices are captured,

many trusted users that do not have discoverable bluetooth will not be shown. This issue

will be of lesser concern as the adoption of iTrust increases.

We rated the recommendations of iTrust for each of the 5 filters (including

Hybrid Filter with equal weights) on 4 metrics, 1: number of trusted user in range

top 1 to 10, 11 to 20, etc (also known as Precision metric in Information Retrieval

literature), 2. percentage of total trusted users in Top 1 to 10, 11 to 20, etc, 3. fraction of

encounter users needed (from top) to capture ‘x’% of trusted users for each filter, and 4.

Normalized Discount Cumulative Gain (NDCG) [43].

For metrics 1,2 and 3, we considered users in descending order of the encounter

score by each filter. For metric 1, we then counted the number of trusted users (as

marked by the user) in 1 to 10 top user, 11 to 20 top users, etc. Fig. 4-17A. shows the

102

Page 103: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

1 to 10

11 to 20

21 to 30

31 to 40

41 to 50

51 to 60

61 to 70

71 to 80

80 to End

0

10

20

30

40

50

Per

cent

age

of T

rust

ed U

sers

Rank Ranges

FE DE LVC LVD Hybrid

A.

1 to 10

11 to 20

21 to 30

31 to 40

41 to 50

51 to 60

61 to 70

71 to 80

80 to End0

10

20

30

40

50

Trus

ted

Use

rs P

erce

ntag

e

Rank Range

Hybrid LVD LVC DE FE

B.

0 10 20 30 40 50 60 70 80 90 100 1100

5

10

15

20

25

30

35

40

45

Per

cent

age

of E

ncou

nter

ed u

ser (

Des

cend

ing

Filte

r sco

re)

Percentage of Trusted Users Included

FE DE LVC LVD Hybrid

C.

FE DE LVC LVD Hybrid0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

NDCG

Filters

D.

Figure 4-17. iTrust evaluations based on application usage. Fig. A shows thepercentage of trusted users in 1 to 10 Top user, 11 to 20 Top users for eachfilter. Fig.B shows the percentage of total trusted users in Top 1 to 10, 11 to20, etc. Fig C. shows fraction of encounter users needed (from top) tocapture ‘x’% of trusted users for each filter. Fig D. shows the NormalizedDiscount Cumulative Gain score for iTrust recommendations.

graph for this metrics for all the filters. It shows that on average, out of top 10 ranked

users by FE, DE and Hybrid filters, 5 (50%) or more users are marked trusted. We

see that LV filter’s top 10 ranks have 3 to 4 users on average, however if we consider

top 20 users, all filters capture 6-8 trusted users (more than 50% of the total trusted

users). The number of trusted user in rest of the ranges continue to fall except in the last

range as it contains all the users ranked beyond 80. For all the filters, there is a strong

statistically significant correlation between the score and the rank of trusted users (e.g.,

for LVC, r=0.84, p <0.01). This shows that users willingness to trust others in a mobile

network to be statistically correlated with their behavioral similarity as captured by iTrust.

103

Page 104: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Result for metric 2 are similar to metric 1 as show in Fig. 4-17B.. We note that out

of all the users marked trusted, more than 50% of the trusted users are in rank 1 to 10

(except LV filters). And almost 80% of the trusted users are capture in rank 1 to 20.

Metric 3 measures the fraction of encounter users needed (from top) to capture

‘x’% of trusted users for each filter. This metrics shows that 80% of the trusted users

are captured by top 25% of the encountering user as ranked by the filters and their is a

strong statically significant correlation (Fig. 4-17C.).

Metric 4, which is based on DCG measure is used to measure effectiveness

of search engines by giving more score to search results that are more relevant.

Normalized DCG (NDCG) is a ratio of DCG and IDCG (Ideal DCG). The IDCG can be

calculated by finding out the best possible search result (in our case all the trusted users

should be ranked first and then non trusted users should follow). NDCG, therefore tells

us how far the current results are from ideal. We note from Fig. 4-17D. that iTrust is able

to capture close to 70% of the IDCG via FE, DE and Hybrid filters and close to 50% of

IDCG via LV filters. This shows that our recommendations are relevant and close to the

ideal case.

We also note that there are users who have high rank, yet they are not trusted. We

believe, these can be the encountered users, who are very similar to the user and can

provide new interaction opportunities to the user.

Other observations from the deployment include that almost 70% user preferred

using equal weights for the HF filter. The amount of storage used by the application, on

average was 6.2MB, with storage of filter scores taking only 98KB, rest was occupied

by the encounter traces. This shows that storage overhead of iTrust filters is quite small

when compared to the raw traces. The raw traces can be removed from the device after

processing to save space. Also at this rate, 75MB is needed for storing traces for the

whole year.

104

Page 105: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

4.6.2.2 Energy Efficiency

Scanning of Bluetooth and WiFi devices consumes maximum power (since the

scanning process is periodic). After receiving the traces (which were scanned at 1 min

interval), we noted that due to spatial locality in the traces, we can skip the scanning

rounds if we find the same devices again in the next round, assuming that the user is in

same location with same devices. Based on this assumption of spatial locality, we have

designed and implemented an energy efficient algorithms for iTrust. More details are in

the Appendix B.

4.6.2.3 Location estimation

For calculation meaningful Location Vectors, at least building level granularity

is needed (granularity needed may also be depended on the trust context). On a

mobile device, location can be estimated by using several techniques. Some standard

techniques are: 1. GPS (does not work well indoors and has “warm up” delays), 2.

Wifi Signals (may not be very accurate and may not work everywhere) 3. Cell Tower ID

(cannot work with devices that dont have phone functionality). GPS may be the most

accurate localization technology but needs the most energy and Wifi/Cell Tower ID

method needs a online database lookup to get coordinates [50, 54]. In our application,

where we immediately don’t need location coordinates (only when encounter locations

are shown on the map) and since we have observed that users have spatial locality,

scanning of Wifi signals and Cell towers works. Once in a while the app can fetch

mapping of Wifi and Cell towers to location coordinates and because of the spatial

locality in user movement pattern, we only need to fetch mapping for locations that have

not been visited previously. This scheme saves energy (not using GPS every time) at

the cost of communication (communication can wait till the phone is fully charged and

connected to high speed network) and accuracy.

When using Wifi/Cell ID for localization, we want to minimize communication costs

(do as few a coordinate/location lookups as possible). For reducing lookups, we cache

105

Page 106: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

the wifi AP sets (we get a set of APs every time we scan, number of AP depends on the

location) whose coordinates we know. Upon getting a set of Wifi APs from a new scan,

we go through the cached wifi AP sets whose address we already know. Few challenges

with this scheme include 1. APs scanned at same location may change with time (AP

could have moved to another location or may not get scanned every time (scanning

noise)) 2. Since every localization scheme using Wifi signals employs heuristics for

location estimates (accuracy for Google database is ≥ 150m), different AP sets may

give same coordinates (collision in location space). To solved the first challenge we look

at the intersection of the two sets (one cached and another recently scanned) and if the

intersection is greater than a percentage (say 30%) the two sets are considered to be

the same. For the second challenge we currently do not have a worked out solution.

It is important to create a new location field only when the user visits a new location,

otherwise actual time spend at a location can get fragmented or fused and may result in

incorrect scores. We note that sometimes when we have slightly overlapping APs sets

(say only 10%), address lookup for both of them may return same coordinates. We plan

to address these issues in the future.

4.7 Discussion: Other Trust Inputs

As explained in section 3, iTrust was architected to potentially integrate with other

trust components and sub-systems, including blacklisting, other recommendation

systems, and contextual information.

4.7.1 Blacklist & Whitelist

A blacklist contains a list of devices that have been marked by the user as

untrustworthy (either explicitly, or after agreeing with an anomaly detection flag) and

should not be trusted regardless of their similarity or score. A whitelist contains a

list of devices to be trusted regardless of the similarity scores. The blacklist/whitelist

module was implemented in the iTrust app to allow a user to override the trust adviser

106

Page 107: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

filters’ scores. The rate of system overriding was one accuracy metric considered in the

evaluation.

4.7.2 Recommendation & Reputation Systems

Several techniques for recommendations systems have been proposed [33, 72] and

iTrust is designed to integrate/adopt such (or similar) recommendation systems. iTrust

can also bootstrap a recommendation system, since recommendation system scores

start to evolve only after initial direct interaction.

Furthermore, a trusted node, over a period of time, may start showing malicious

network behavior (e.g., dropping packets frequently). Reputation systems attempt to

detect such nodes, and can be integrated with iTrust effectively to allow iTrust users to

detect and remove devices that at one time showed high trust potential but later turned

malicious/selfish. An example reputation system can be [19]. This reputation system

considers second-hand information, where users maintain reputation only for users they

communicate with (for iTrust it can be all the encountered users). One challenge here

would be to keep the communication costs to a minimum and to detect false advice.

This integrated system can also keep track of incorrect recommendations provided and

failed message routing info.

4.7.3 Contextual & Event Information

The context of an encounter; e.g., event and/or location, is sometimes more

important than the encounter statistics per se. Example of such scenario may be a

conference which only allows registered users to enter the venue or a secure building

that requires special permits to enter. In these cases, a user may be willing to trust users

regardless of the encounter frequency or duration.

For these scenarios, iTrust provides this module that can change trust recommendation

based on the context and the location of the user. Here, context sensing systems [8, 71]

or user input can be used to infer the context.

107

Page 108: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

4.7.4 Combined Trust Recommendation

iTrust needs to provide easily understandable information to the user. Providing

scores from independent modules separately may confuse the user. As a first step to

simplify the output, we created a Hybrid Filter, combining the trust filter scores. A similar

idea can be used to combine the scores from all the modules discussed above and

generate a single score of trust for an encountered user. The scores can be combined

using the following:

T (Uj) = κ(δH(Uj) + (1− δ)(m∑i=1

βiRi(Uj)))

+(1− κ)Context(Uj),

(4–9)

where T (Uj) represents the combined trust recommendation for the encountered user

Uj , it is always between 0 (no trust) and 1 (max trust). H(Uj) is the score from Hybrid

Filter (Sec. 4.3.3)). βi represent the weights for other normalized trust related inputs (Ri )

such as anomaly detection, recommendation system, reputation systems among others.

Here∑mi=1 βi = 1. The factor δ decides the combination ratio of Hybrid Filter and other

trust related inputs. δ varies between 0 and 1 so the combined score is also between 0

and 1. Context(Uj) is the function that gives context score to trust Uj . The output varies

from 0 to 1. The contribution of context in the trust is controlled by κ parameter. If the

user (Uj ) is included in whitelist then iTrust does not have to evaluate this user as it is

already trusted. However, if a user exists in blacklist he can not be trusted (trust scores

are disregarded).

The challenge lies in finding out the correct weights (β, δ,κ) to combine different

inputs. These weights depend on the user preferences and applications. From a survey

we conducted, it is clear that there is no single weight scheme that is acceptable to all

users (more details in Sec. 4.6.1). One possible way to overcome this challenge is to

have standard weights when starting the system and it adapts (sets weight) according to

the selections/feedback given by the user.

108

Page 109: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

4.8 Conclusion and Future Work

This work introduces, iTrust, an effective encounter based framework for trust

establishment in mobile communities in an efficient, privacy-preserving and resilient

manner. iTrust is driven by trust adviser filters that take advantage of the increased

sensing capabilities of the mobile devices and their close association with users, which

enables them to capture behavioral similarity with encountered devices and assess

levels of trust.

We use four novel encounter based trust adviser filters, based on encounter

frequency, duration, location behavior-vector and behavior-matrix to generate trust

recommendations. iTrust provides scores reflecting the level of trust to aid the user to

choose trustworthy nodes in coordination with personal preferences, location priorities,

contextual information and/or encounter based keys. The calculations are done in fully

distributed fashion, which eliminates the need for any server or trusted third party

Results of three phases of evaluation reveal that several filters possess high stability

and that trust forms a small world among trusting users. Further, resilience to attack

using anomaly detection achieves less than 10% false positives and 7% false negatives.

Selfishness analysis using trust based epidemic routing shows that it is possible to

efficiently use meaningful, stable trust routing without sacrificing network performance

in DTNs. Ultimately, a series of surveys and participatory experiments consolidate our

belief that users willingness to trust other devices is highly correlated with behavioral

similarity. Feedback from iTrust application shows that users favor the hybrid filter, the

recommendation of which conforms with 80% of users’ selections.

iTrust has been designed to inspire several potential applications that can be

enabled in future. However, there are a few avenues that require further research. In

future, we plan to address some of these questions such as handling multiple devices

belonging to a user. In addition, addressing issues emerging from MAC address

spoofing are part of future research (several crypto-based and non crypto-based

109

Page 110: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

techniques exist [83]). Future work will include analysis of other filters for measuring

behavioral similarities. We also want to develop and deploy iTrust for popular mobile

platforms and study the effect of its usage on a larger scale.

With the release of the iTrust application, we can measure the encounters to

generate the trust scores. In the future we would like to investigate how exchange of

indirect trust recommendations affect the trust scores (transitivity of trust). The trust

framework presented in this study sits below the application layer in the mobile platform

and can provide trust scores to any requesting application on the mobile device. In

the future we plan to build applications that can benefit from trust scores. An example

of such an application can be crowd-sourcing. In crowd-sourcing applications users

report observations (it may be regarding gas prices in their neighborhood [9], restaurant

reviews, freeway traffic [10]). With the knowledge of who uploaded the data (is this

person in my trust list), the phone can automatically highlight information coming from

trusted sources, which may be more believable.

The knowledge of encounters and establishment of trust using them, can be used

to provide emergency services, an example application SOS [78], utilizes iTrust scores

to alert trustworthy user in the neighborhood in case of emergency situations. With the

inclusion of anomaly detection, iTrust can also generate lists of possible threats in the so

surroundings (such as presence of a stalker). In the iTrust application, a user can rate

a user on a range of levels from not-trust at all to fully trusted. These levels can also be

utilized by applications such as SOS to automatically access the threat level.

Since iTrust generates trust scores via encounter information, it can also be used to

identify users with similar interests. This information can be used to automatically form

support or meetup groups. We would like to investigate how successful an encounter

based scheme can be in discovering users with similar interests. We would also like

to examine how having a encounter measuring system in places such as hospital can

be used to evaluate patient care (number of doctor visits) and can also be used to

110

Page 111: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

forensically examine and quarantine the spread of pathogens in a hospital by looking at

the encounter history.

iTrust can also applied to communication scenarios where existing infrastructure

cannot be trust (an extreme example can be a scenario where a section of population is

revolting again a regime and regime is monitoring all the communication). In these

cases, if the revolting section of the population has been using iTrust and have

established pair-wise security keys, they can communication over any medium (including

Ahdoc and DTN) by encrypting the messages. Here the role of iTrust is to identify

the users with whom this user might want to communicate later and thus enable key

exchanges with only relevant users. For this scenario and others, we would like to

investigate the correlation between trust level and frequency of communication between

users.

There is a need to conduct more research in order to understand how trust can be

established in mobile societies. We hope that this research contributes to that effort.

111

Page 112: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

CHAPTER 5CONCLUSION AND FUTURE WORK

In this work, we propose several techniques to infer complex social relationships

and patterns using network data. We propose novel methods, which use WLAN traces

to classify WLAN users in to social groups based on features such as gender and

study-major among others. The work presents a general framework that can be applied

to traces coming from multiple sources. As an example, traces from two university

campuses have been used and gender based grouping classification is performed.

Multiple techniques for grouping users are discussed since each one has slight

advantages in certain scenarios. The study cross-validates the results by comparing

results provided by each of the classification methods.

We uncovered a serious problem in the way WLAN traces are anonymized. We

believe that this kind of attack is possible as WLAN traces have human mobility pattern

embedded in them, which can be easily observed by an attacker following the victim.

The aim of any privacy protecting technique should be to ensure that even if attacker

has access to all the publicly available information about a user or a group of users

(but not the mapping between anonymized MAC and real MAC), he should not be

able to reduce the sample size below a number, say K. This K should be a parameter

configurable by the trace releasing authority.

This work proposes, iTrust, an effective encounter based framework for trust

establishment in mobile communities in an efficient, privacy-preserving and resilient

manner. iTrust is driven by trust adviser filters that take advantage of the increased

sensing capabilities of the mobile devices and their close association with users,

which enables them to capture behavioral similarity with encountered devices and

assess levels of trust. We use four novel encounter based trust adviser filters, based on

encounter frequency, duration, location behavior-vector and behavior-matrix to generate

trust recommendations. iTrust provides scores reflecting the level of trust to aid the

112

Page 113: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

user to choose trustworthy nodes in coordination with personal preferences, location

priorities, contextual information and/or encounter based keys. The calculations are

done in fully distributed fashion, which eliminates the need for any server or trusted

third party. Results of three phases of evaluation reveal that several filters possess high

stability and that trust forms a small world among trusting users. Further, resilience to

attack using anomaly detection achieves less than 10% false positives and 7% false

negatives. Selfishness analysis using trust based epidemic routing shows that it is

possible to efficiently use meaningful, stable trust routing without sacrificing network

performance in DTNs. Ultimately, a series of surveys and participatory experiments

consolidate our belief that users willingness to trust other devices is highly correlated

with behavioral similarity. Feedback from iTrust application shows that users favor the

hybrid filter, the recommendation of which conforms with 80% of users’ selections.

In the future, we want to look into user behavior study from the perspective of

buildings and locations. This will allow us to find out the trends in user behavior based

on the study-major and building preferences.

The ability to classify users into social groups can allow us to create models for

different groups of users based on usage characteristics. These models can not only

be used to understand different users characteristics but can also be used to filter users

that our proposed schemes could not. We also want to test if homophily, based on of

encounters exists among different social groups. Another area of research that we would

like to target is to look at the packet or the netflow traces to understand effects of social

group affiliations on browsing characteristics.

For the privacy and anonymity work, we would want to work on designing

anonymization schemes that are application specific. For example, the traces are

anonymized such that routing protocols can be tested on it without any privacy leak.

This may allows us to maintain privacy and yet utilize traces for research purposes. One

of the directions for Mobile Ad-hoc routing protocol testing would be to anonymized the

113

Page 114: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

traces (such that it is completely privacy preserving) without affecting the encounter

probabilities between the pair of users.

114

Page 115: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

APPENDIX ACODE SNIPPETS FROM ITRUST APPLICATION

In this appendix we will present some sections of iTrust code along with the block

diagram showing the evolution of iTrust application with each release. The first version

of the application was under internal release to the members of our research group.

Based on the feedback we received, more features were added and then it was released

to a group of 30 students. This lead to a thorough testing of the applications. The users

complained about unavailability of device to name mapping and energy efficiency the

most. These features were also added in the version 3 of the iTrust application. The

approximate evolution of the development including the features added is shown in the

Fig. A-1. The text over the arrows connecting one block to another depict the request of

features/functionality by the users.

In the following sections, we present some of the code snippets which have been

developed for the iTrust application. We hope that these snippets will provide sufficient

implementation details about the iTrust app.

1. Scanner

2. FE, DE, LV

filters

3. Mark trusted

users

1. Encounters on map

2. Facility to upload

Traces on server

3. Automatic error

reporting

4. Combined Filter

5. Show currently

encountering devices

1. Energy efficient scanning

2. Mac address to name

lookup

3. Graphs to show

encounter statistics

4. Users can be trusted on

a scale

Version 1 Version 2 Version 3

more encounter

info

Auto collection of

traces

Increase battery

life

Name lookup

Figure A-1. Evolution of features in the iTrust app based on feedback from user.

A.1 Energy Efficient Scanning

iTrust application has three algorithms of scanning bluetooth and wifi scanning.

Simplest one of them is an infinite loop with 100 sec sleep between the consecutive

execution and in each cycle it basically scans both Wi-Fi and Bluetooth devices. Two

of the algorithms perform energy efficient scanning. The code in List. A.1 illustrates the

115

Page 116: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

algorithm used to decide the scanning interval. The input parameter state is set to zero if

any new device is found, otherwise

public s t a t i c i n t [ ] f i b o = {0 ,1 ,2 ,3 ,5 ,8 ,13 ,21 ,34 ,55 ,89} ;i n t ca lSk ipFac to r ( i n t s ta te ) {

i f ( s t a te == 0) {f a c t o r = 1 ;

}else i f ( s t a te == 1) {/ / MaxThres i n d i c a t e s the maximum value al lowed i n

FIBO ser i esi f ( f i b o [ f ac to r −1] < MaxThres ) {

f a c t o r ++;}

}/ / System . out . p r i n t l n ( f a c t o r ) ;return f i b o [ ( f a c t o r −1) ] ;

}Listing A.1. Function for calculating how many scanning periods to skips. Input

parameter state is 0 if a new device is found and it is 1 otherwise. Moreabout FIBO algorithm is in Appendix B

A.2 Calculating LV ( Sec. 4.3.2)

public i n t calLvScore ( TreeMap<In teger , EncLocation> userMap ,f l o a t sumCU2, f l o a t sumDU2) { / / score i s ca l wr t userMap

EncLocation l 1 = null , l 2 =nul l ;f l o a t sumCU1=0 , sumDU1=0 ,prodC=0 , prodD =0;C o l l e c t i o n c = locMap . values ( ) ;I t e r a t o r i t r = c . i t e r a t o r ( ) ;while ( i t r . hasNext ( ) ) {

l 1 = ( EncLocation ) i t r . next ( ) ;/ / Log . i (TAG, ” calLvScore f o r user : ” + t h i s .Name + ”

Locat ion i d ”+ u1 . l o c I d + ” du ra t i on and count ” + u1 .du ra t i on + u1 . count ) ;

i f ( ( l 2 = userMap . get ( l 1 . getLocId ( ) ) ) ==nul l ) {Log . e (TAG, ” EncUser Check the userMap . . i t i s missing

values present i n locMap . . imposs ib le ” ) ;return −1;

}sumCU1 += ( f l o a t ) l 1 . getCount ( ) ∗ ( f l o a t ) l 1 . getCount ( ) ;sumDU1 += ( f l o a t ) l 1 . ge tDura t ion ( ) ∗ ( f l o a t ) l 1 . ge tDura t ion ( ) ;prodC += ( f l o a t ) l 1 . getCount ( ) ∗ ( f l o a t ) l 2 . getCount ( ) ;prodD += ( f l o a t ) l 1 . ge tDura t ion ( ) ∗ ( f l o a t ) l 2 . ge tDura t ion ( ) ;

}

116

Page 117: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

score [ 2 ] = ( f l o a t ) ( prodC / ( Math . s q r t (sumCU1∗sumCU2) ) ) ;score [ 3 ] = ( f l o a t ) ( prodD / ( Math . s q r t (sumDU1∗sumDU2) ) ) ;return 0;

}Listing A.2. Function that calculates the LV values for a user. ‘userMap’ contains location

visited data for the owner of the device and ‘locMap’ contains the encounterinformation along with the location for a particular encountering device

117

Page 118: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

APPENDIX BENERGY EFFICIENT DEVICE DISCOVERY

Efficient use of energy is essential for always running mobile applications such

as iTrust. We have looked into some aspects of it and developed an energy efficient

scanner for iTrust as discussed earlier. We use this space to provide more details about

our technique.

B.1 Available Directions

Below are some of the directions that can be utilized to design an energy efficient

device discovery for both Bluetooth and WiFi. In each of the methods the core idea is to

avoid/reduce scanning when no new devices are discovered. The challenge, however, is

not to miss any new devices.

1. Use current scan response to determine next scanning time

2. Use temporal locality: Use weekly pattern to predict number of encounters perweek on per hr basis... system will have to maintain a time table for 7 day x 24hours

3. Use spatial locality: Use location information to predict encounters. New locationmay need aggressive scans.

Since scanning process is very similar in Bluetooth and Wifi, any technique

developed for Bluetooth can be used for Wifi and vice-versa. Show that scanning

characteristics of Bluetooth are similar to Wifi i.e. same techniques applied to Bluetooth

will also work with Wifi. show effect of skipping in Bluetooth has equivalent affect on

WiFi. Details about scanning are explained here [31].

B.2 Evaluations Techniques

Through the deployment of iTrust, we have collected atleast one month of traces

from 20 users and some of the users have used iTrust for more than one year. These

traces included both Bluetooth and WiFi scans done at 100 seconds interval. For the

evaluation and comparison of energy efficient methods, we proposed to use these

traces as the ground truth. The energy efficient algorithm can take these traces as

118

Page 119: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

input and produce an output traces based on the algorithm. By comparing the input

and output and the number of scans we can compare the effectiveness of the energy

efficient algorithms.

B.3 Current Progress

Currently, we have only looked into algorithms that use current scan response to

determine next scanning time period. These methods generally work by looking that

the number of devices (new and already seen) found in the current scan to determine

the sleep interval before the next scan. Researchers have developed several algorithms

including STAR [80] and others [31]. Only STAR is evaluated using real-traces, rest

use some kind of artificial traces. Hence to test our proposed algorithms, we have

considered only STAR Algorithm. The two of our proposed algorithms are : one based

on multiplicative increase and multiplicative decrease (MIMD) (similar to [31]) and

another based on growth rate of Fibonacci Series.

Star Algorithm: Uses a method to estimate arrival rate based on the number of

new devices detected in the current scan round and also increase the scan rate if the

current time is greater than 8 am.

MIMD Algorithm (EE): doubles current scan time interval if no new device is found

(we have an upper bound on the time interval). On detecting a new device, the scan

time interval is reduced to the minimum possible period.

Fibonacci Series based Algorithm (FIBO): uses the Fibonacci series to decide

the number of scan cycles to skip (otherwise similar to EE). The growth is 0, 1, 1, 2, 3, 5,

8, 13, 21 and so on.

We have compared the above three algorithms for efficiency (saving of scans) and

accuracy (not missing any encounter). We measure accuracy by counting occurrence

of each device in the trace produced by each method. To make efficiency metric

independent of accuracy, we assume that during the time interval when no scanning is

119

Page 120: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

performed, each of the algorithms assume that last encountered devices are being seen

(therefore they skipped scanning).

The Tab. B.3 shows the accuracy results from running different energy saving

algorithms. We note that our scheme EE4 out performs both STAR and FIBO and also

shows lower Standard Deviation. We can find the comparison of efficiency in Tab. B.3.

EE16 seems to be giving the best savings, followed by EE8 and FIBO16 and then STAR.

However, since we want an algorithm that is both accurate and efficient at the same time

. We have devised a new metric called ‘s/e’ ratio. This a ratio between the efficiency

and accuracy-loss. If an energy saving scheme provides more saving and less error the

‘s/e’ ratio would be higher than the one providing similar saving but worse error rates. To

choose an algorithm, one may first decide on the savings needed (based on the current

energy budget) and then choose an algorithm that performs the best based on ‘s/e’. The

Tab. B.3

Table B-1. Accuracy Loss using traces for 20 users, EE4 means 4 times the minimumscan period is the upper bound of scan interval, similarly in EE8, the upperbound on skip period is 8. This result used Bluetooth traces only. Lesservalues is better

Average Std. Dev.STAR 9.97 7.49EE4 7.45 4.38EE8 10.45 5.84EE16 13.65 6.81FIBO4 8.24 3.90FIBO8 8.58 3.95

FIBO12 10.93 5.42FIBO16 12.26 6.04

B.3.1 Combining WiFi And Bluetooth Scanning

We now present the results of combining Wi-Fi and Bluetooth scanning with the

energy efficient scanner. The scan time interval now depends on the results of both

Wi-Fi and Bluetooth scans. Wi-Fi scanning has following properties different that

Bluetooth scans, i. Majority of Access Points are stationary, ii. it is possible to miss out

120

Page 121: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Table B-2. Scan Efficiency using traces for 20 users, EE4 means 4 time the minimumscan period is the upper bound of scan interval, similarly in EE8 & EE16 its 8& 16 times respectively. This result used Bluetooth traces only. Higher valueis better

Algo. Average Std. Dev.STAR 64.64 8.22EE4 57.81 9.56EE8 66.45 11.56EE16 70.81 13.12FIBO4 60.28 11.68FIBO8 62.79 12.86

FIBO12 64.87 12.80FIBO16 66.11 14.40

Table B-3. s/e ratio for Star, MIMD and FIBO algorithmsAlgo. s/eSTAR 6.49EE4 7.76EE8 6.36EE16 5.19FIBO4 7.31FIBO8 7.32

FIBO12 5.93FIBO16 5.39

on an AP, even though it has strong signals strength at the location, and iii. Range of

a WiFi AP is much larger than a Bluetooth device. Using these properties of WiFi, we

designed the matching up of scanned AP less stringent, i.e. if number of common AP in

the two sets of scans is more than number of distinct AP found and number of common

is greater than 0, we consider it to be the same location (same set of AP seen). This is

slightly different than Bluetooth scanning where exact same number of users are needed

to consider the two scans to give same results. Also, due to the application requirements

of iTrust, we cannot let Wi-Fi and Bluetooth work independent of each other.

B.4 Conclusion

We note that STAR, EE4 and FIBO4 algorithms perform closely, but EE4 algorithm

is a clear winner in terms of the ‘s/e’ ratio, next being FIBO algo. We note that EE and

FIBO algorithms are parametric. In an event, where accuracy can be sacrificed to save

121

Page 122: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Table B-4. Combining Wi-Fi and Bluetooth scanningAlgo. Error Saving s/eSTAR 11.47 65.84 5.74EE4 7.45 54.42 7.30EE8 10.94 63.03 5.76

EE16 14.66 67.54 4.60FIBO4 8.23 56.75 6.89FIBO8 9.09 59.27 6.52FIBO16 11.76 62.20 5.29

energy, higher threshold for scan time interval can be selected that is not possible in

STAR algorithm, thus they provide a efficiency grade selection mechanism. Current

implementation of iTrust uses EE4 and FIBO4 algorithm for performing energy efficient

scanning.

122

Page 123: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

APPENDIX CUSER BEHAVIOR ANALYSIS

Below are results from all the areas we could identify in the Universities.

C.0.1 Spatial Distribution

The details are in Tab. C-2 and Tab. C-1.

C.0.2 Temporal Distribution

Details are in Tab. C-4 and Tab. C-3

123

Page 124: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Table C-1. Spatial Distribution of Users at U2Area Male-Oct07 Female-Oct07 Male-Nov07 Female-Nov07 Male-Mar08 Female-Mar08 Male-Apr08 Female-Apr08

Administration 1152 1140 1096 1136 1254 1327 1507 1656Agriculture and biology 197 113 180 127 339 264 471 343

Architecture 543 587 464 569 589 651 708 784Biology 330 331 339 337 411 435 524 515

Bookstore 172 125 145 128 247 176 333 264Economics 846 591 907 666 965 704 1118 867

Cafeteria food 278 223 263 205 301 248 332 287Computer Engineering 975 763 930 789 1080 841 1300 1097

Fine Arts 488 543 410 505 524 610 642 787Fraternity 254 84 246 113 268 123 326 184

Health sport human 556 460 495 450 562 598 679 806Infirmary 151 124 161 152 203 202 159 148

Communication 406 418 411 475 445 538 545 659Law 566 511 558 523 522 495 696 656

Music 122 105 111 70 230 197 330 308Philosophy and Stati 94 109 119 124 121 92 152 163

Psychology 78 71 80 83 87 77 116 106Recreation food cafeteria 192 273 137 254 154 302 111 263

Social Science 818 815 818 880 858 833 1043 1042Sorority 271 969 299 959 331 991 529 1138

Space science and CNS 321 229 282 258 377 321 224 236Sport recreation 119 121 85 103 131 136 148 131

Theater 121 139 121 146 131 143 155 211University Auditorium 43 41 45 37 48 48 61 58

Engineering 1900 895 1784 888 2033 1139 2437 1371Library 3767 3749 3415 3667 3556 3903 4497 4968

Table C-2. Spatial Distribution of Users at U1Area Male-Feb2006 Female-Feb2006 Male-Oct2006 Female-Oct2006 Male-Feb2007 Female-Feb2007

Accounts 11 5 21 22 15 14Admin 7 9 13 16 7 10

chemistry 9 8 9 7Communication 96 81 115 109 19 48

Economics 37 26 69 58 56 36Engineering 26 35 37 37 44 31

Law 3 1 5 2 0 3Medicines 6 3 6 8 7 15

Music 9 11 7 12 4 10Residence 42 48 53 47 52 49

Social 88 113 143 161 110 128Sports 16 19 12 21 4 11

Table C-3. Average Duration of Users at U2Area Male-Oct07 Female-Oct07 Male-Nov07 Female-Nov07 Male-Mar08 Female-Mar08 Male-Apr08 Female-Apr08

Administration 2830.54 2674.35 2708.18 2515.91 3005.99 2735.44 2535.49 2756.56Agriculture and biology 5496.84 2835.95 4605.61 2804.1 6646.08 5334.13 4045.3 3166.2

Architecture 3102.69 4472.13 3819.61 5723.87 3990.28 4247.61 3774.16 4221.17Biology 2855.78 3770.86 3259.26 3801.92 2643.61 2385.45 2397.15 2471.11

Bookstore 1425.17 1717.32 1720.15 737.4 1568.72 1398.41 1238.44 1485.88Communication 3062.22 2974.99 3240.94 3067.82 2652.34 2693 2830.52 2758.33Cafeteria food 1322.97 1755.13 1779.43 1332.48 1617.81 1283.37 1655.4 1546.05

Computer Engineering 2226.74 2017.67 2387.85 2070.76 2266.88 1735.37 2613.42 2038.3Fine Arts 3723.13 3234.67 4788.84 3702.77 3945.24 3509.83 4439.7 3519.1Fraternity 6102.32 3132.25 5627.62 2724.89 6250.4 2825.14 6041.43 2275.41

Health sport human 2021.73 2345.55 1719.18 2161.11 2063.47 1895.39 2083.5 2004.84Infirmary 851.93 1702.52 885.8 1224.36 978.22 1114.76 1392.41 1140.61

Journalism 1895.75 2125.34 2288.49 2179.88 1976.58 1801.81 2143.43 1880.18Law 3191.82 3212.97 3430 3614.9 3849.59 3760.19 4555.09 4695.18

Music 1911.7 1711.29 2565.34 1851.83 1767.29 1167.49 1764.87 1210.22Philosophy and Stati 4464.41 2168.24 2484.02 2475.97 2923.91 1469 3576.86 2241.14

Psychology 4317.27 5591.4 3740.35 4841.61 4541.85 3262.46 3415.07 4058.18Recreation food cafeteria 3346.32 3949.6 3977.89 4763.73 2754.9 2955.62 2528.34 3130.86

Social Science 1513.08 1809.37 1582.34 1858.61 1728.01 1643.11 1563.03 1736.9Sorority 3681.18 5881.25 4396.69 5658.94 2035.76 5035.05 2131.98 5171.22

Space science and CNS 2200.49 1492.75 2082.87 1681.06 1819.9 1423.21 3427.35 1895.1Sport recreation 1489.49 1683.24 2230.28 1600.73 1064.57 1763.8 941.31 1141.93

Theater 1548.75 1810.34 1791.42 1658.96 2434.57 2035.92 2377.37 2109.12University Auditorium 3088.45 3131.85 2902.46 4571.47 1362.95 1902.15 1497.05 1852.46

Engineering 2696.45 2361.65 2693.97 2433.75 2664.03 2167.38 2825.3 2486.6Library 3953.34 4156.35 4168.5 4531.48 3875.23 4067.77 4388.33 4618.98

124

Page 125: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

Table C-4. Average Duration of Users at U1Area Male-Feb2006 Female-Feb2006 Male-Oct2006 Female-Oct2006 Male-Feb2007 Female-Feb2007

Accounts 1108 636 956.65 1114.98 484 1206 AccountsAdmin 835 1612 346.89 1162.18 536 432 Admin

Chemistry 1806 1411 842.24 896 900 720 ChemistryCommunication 1862 2007 1474.38 1417.27 1838 2758 Communication

Economics 2044 1587 1826.88 2204.25 1729 2745 EconomicsEngineering 2797 1834 2341.09 2380.02 2181 782 Engineering

Law 1545 2096 4776.09 1468.76 91 3528 LawMedicine 2860 963 1562.8 1940.78 1723 2450 Medicine

Music 2354 1395 1090.04 686.81 493 534 MusicResidence 2341 1510 1491.92 1185.63 1861 1401 Residence

Social 2341 2787 2162.14 2336.53 2008 2243 SocialSports 1652 2191 636.22 895.02 650 594 Sports

125

Page 126: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

APPENDIX DSURVEY FORM - ITRUST VALIDATION

SURVEY: Encounter-based TrustUdayan Kumar and Ahmed Helmy

{ukumar, helmy}@ cise.ufl.edu, University of Florida, Gainesville.[Assume your device has enough battery and computation power. Also, your device runs a Bluetooth scanner program that records number, duration and location of

encounters with other devices. An encounter occurs when two devices appear in the radio range of each other.]

Please rate (on scale of 1 through 10) your willingness to cooperate with other peer

devices to setup an Ad Hoc or Delay Tolerant Network (DTN).

1. If your device does not have any information about other devices (Strangers)?

2. If your device identifies another device as frequently-encountered (e.g., more than10 times in the last week)?

3. If your device identifies the other device as encountered for long duration (e.g., formore than 5 hrs total in the last week) but infrequently (e.g., less than 4 times inthe last week)

4. If your device identifies the other device as encountered with both high frequencyand long durations.

5. If the encounter locations are visited frequently by your device.

6. If the encounter locations have restricted access (e.g., mobicom or NSF).

7. Rate each of the factors that would most affect your willingness to accept amessage:

a. Frequency of encounters

b. Duration of encounters

c. Location visited

8. What do you think is the most important combination of the above factors to haveyou trust others to cooperate in an Ad Hoc or DTN setting? (e.g., do you need allfactors (freq, duration, locations) or stats in the restricted locations are enough?)

Other Comments:

Your participation in this study is completely voluntary. There are no anticipated risks, compensation or other direct benefits to you as a participant in this study. You are

free to withdraw your consent to participate and may discontinue your participation in the study at any time without consequence.

126

Page 127: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

REFERENCES

[1] Supplementary Information. Available from: https://sites.google.com/site/confanon/.

[2] Popular baby names, September 2007. Available from: http://www.ssa.gov/OACT/babynames/.

[3] UNC/FORTH: Repository of traces and models for wireless networks, SyslogDataset #2, August 2007. Available from: http://netserver.ics.forth.gr/datatraces/.

[4] The Passive Measurement and Analysis Project, June 2008. Available from:http://pma.nlanr.net/.

[5] Predict: Protected Repository for the defense of the Infrastructure Against CyberAttacks, June 2008. Available from: http://www.predict.org.

[6] The Skitter Project, June 2008. Available from: http://www.caida.org/tools/measurement/skitter/.

[7] CRAWDAD, August 2008. Available from: http://crawdad.cs.dartmouth.edu/data.php.

[8] The metrosense project, Feb 2011. Available from: http://metrosense.cs.dartmouth.edu/projects.html.

[9] GasBuddy, 2012. Available from: http://gasbuddy.com/.

[10] Participatory Sensing, 2012. Available from: http://participatorysensing.org/.

[11] R. Albert and A. L. Barabsi. Statistical mechanics of complex networks. Rev. Mod.Phys., Vol. 74, pp. 47-97, 2002.

[12] Mark Alllman and Vern Paxson. Issues and etiquette concerning use of sharedmeasurement data. In IMC ’07: Proceedings of the 7th ACM SIGCOMM conferenceon Internet measurement, pages 135–140, New York, NY, USA, 2007. ACM.

[13] Eitan Altman. Competition and cooperation between nodes in delay tolerantnetworks with two hop routing. In NET-COOP, 2009.

[14] Eitan Altman, Arzad A. Kherani, Pietro Michiardi, Refik Molva, Pietro Michiardi , andRefik Molva . Non-cooperative forwarding in ad-hoc networks. Technical report,PIMRC, 2004.

[15] Fan Bai and Ahmed Helmy. Chapter 1 A SURVEY OF MOBILITY MODELS inWireless Adhoc Networks. Springer, 2006.

127

Page 128: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

[16] Fan Bai, Narayanan Sadagopan, and Ahmed Helmy. The IMPORTANT frameworkfor analyzing the impact of mobility on performance of routing protocols for adhocnetworks. AdHoc Networks Journal, 1:383–403, 2003.

[17] Magdalena Balazinska and Paul Castro. Characterizing mobility and network usagein a corporate wireless local-area network. In ACM MobiSys, 2003.

[18] Christopher M. Bishop. Pattern Recognition and Machine Learning (InformationScience and Statistics). Springer, August 2006.

[19] Sonja Buchegger and Jean-Yves Le Boudec. A robust reputation system for mobilead-hoc networks. In P2PEcon, 2003.

[20] Sonja Buchegger and Jean-Yves Le Boudec. Self-Policing Mobile Ad-HocNetworks by Reputation. IEEE Comm. Mag., 43(7):101, 2005.

[21] Ronald S. Burt. Decay functions. Social Networks, 22(1):1 – 28, 2000.

[22] Levente Buttyan and et al. Barter-based cooperation in delay-tolerant personalwireless networks. In WoWMoM, 2007.

[23] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: Asurvey. ACM Comput. Surv., 2009.

[24] Chia-Hsin Owen Chen and et al. Gangs: gather, authenticate ’n group securely. InMobiCom ’08, 2008.

[25] G. Chen, H. Huang, and M. Kim. Mining frequent and periodic association patterns.Computer Science TR2005-550, Dartmouth College, 2005.

[26] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.Introduction to Algorithms (second edition ed.), pages 350–355. MIT Press andMcGraw-Hill, 2001.

[27] Scott E. Coull, Charles V. Wright, Fabian Monrose, Michael P. Collins, andMichael K. Reiter. Playing devils advocate: Inferring sensitive information fromanonymized network traces. In Proc. of the 14th Annual Network and DistributedSystem Security Symposium, pages 35–47, 2007.

[28] Mark E. Crovella and Azer Bestavros. Self-similarity in world wide web traffic:Evidence and possible causes. IEEE/ACM Transactions on Networking, 5:835–846,1997.

[29] Jon Crowcroft, Richard Gibbens, Frank Kelly, and Sven Ostring. Modellingincentives for collaboration in mobile ad hoc networks. Performance Evaluation, 57,2004.

[30] Ruby Roy Dholakia and et al. The Internet Encyclopedia, chapter Gender andInternet Usage. Wiley, 2003.

128

Page 129: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

[31] Catalin Drula, Cristiana Amza, Franck Rousseau, and Andrzej Duda. Adaptiveenergy conserving algorithms for neighbor discovery in opportunistic bluetoothnetworks. IEEE J.Sel. A. Commun., 25(1), January 2007.

[32] A.C. Gallagher and T.H. Chen. Estimating age, gender, and identity using firstname priors. In CVPR, 2008.

[33] Elizabeth Gray, Jean-Marc Seigneur, Yong Chen, and Christian Jensen. Trustpropagation in small worlds. In Trust management, 2003.

[34] Ben Greenstein, Ramakrishna Gummadi, Jeffrey Pang, Mike Y. Chen, TadayoshiKohno, Srinivasan Seshan, and David Wetherall. Can ferris bueller still have his dayoff? protecting privacy in the wireless era. In HOTOS’07: Proceedings of the 11thUSENIX workshop on Hot topics in operating systems, pages 1–6, Berkeley, CA,USA, 2007. USENIX Association.

[35] T. Henderson, D. Kotz, and I. Abyzov. The changing usage of a maturecampus-wide wireless network. In ACM MobiCom ’04, September 2004.

[36] W. Hsu and A. Helmy. On modeling user associations in wireless lan traces onuniversity campuses. In WiNMee, 2006.

[37] W. Hsu, T. Spyropoulos, K. Psounis, and A. Helmy. Modeling time-variant usermobility in wireless mobile networks. In Proc. IEEE INFOCOM, May 2007.

[38] Weijen Hsu, Debojyoti Dutta, and Ahmed Helmy. Mining behavioral groups in largewireless lans. In MobiCom, 2007.

[39] Weijen Hsu, Debojyoti Dutta, and Ahmed Helmy. Profile-Cast: Behavior-awaremobile networking. In IEEE WCNC, 2008.

[40] Weijen Hsu and Ahmed Helmy. On nodal encounter patterns in wireless lan traces.In WiNMee, 2006.

[41] Weijen Hsu and Ahmed Helmy. MobiLib, June 2008. Available from: http://nile.cise.ufl.edu/MobiLib/.

[42] Peter Hwang and Willem P. Burgers. Properties of trust: An analytical view.Organizational Behavior and Human Decision Processes, 69(1):67–73, January1997.

[43] Kalervo Jarvelin and Jaana Kekalainen. Cumulated gain-based evaluation of irtechniques. ACM Trans. Inf. Syst., October 2002.

[44] T. Karagiannis, A. Broido, N. Brownlee, K.C. Claffy, and M. Faloutsos. Is p2p dyingor just hiding? [p2p traffic measurement]. GLOBECOM, 2004.

[45] Leonard Kaufman and Peter J. Rousseeuw. Finding Groups in Data: An Introduc-tion to Cluster Analysis. Wiley-Interscience, March 1990.

129

Page 130: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

[46] David Kotz, Tristan Henderson, and Ilya Abyzov. CRAWDADdata set dartmouth/campus (v. 2007-02-08). Downloaded fromhttp://crawdad.cs.dartmouth.edu/dartmouth/campus, February 2007.

[47] Udayan Kumar and Ahmed Helmy. User classification and feature extraction fromwlan traces: A gender-based case study (detailed technical report). Available from:http://www.cise.ufl.edu/$\sim$ukumar/techreport-gender.pdf.

[48] Udayan Kumar and Ahmed Helmy. Human behavior and challenges of anonymizingWLAN traces. In IEEE GLOBECOM, 2009.

[49] Udayan Kumar, Nikhil Yadav, and Ahmed Helmy. Gender-basedgrouping of mobile student societies. In MODUS Workshop, IPSN 2008,http : //www.motorola.com/innovators/ ModusWorkshop/Gender Based.pdf,2008.

[50] Anthony LaMarca, Yatin Chawathe, Sunny Consolvo, Jeffrey Hightower, IanSmith, James Scott, Timothy Sohn, James Howard, Jeff Hughes, Fred Potter,Jason Tabert, Pauline Powledge, Gaetano Borriello, and Bill Schilit. Place lab:device positioning using radio beacons in the wild. In Proceedings of the Thirdinternational conference on Pervasive Computing, 2005.

[51] N.D. Lane, E. Miluzzo, Hong Lu, D. Peebles, T. Choudhury, and A.T. Campbell. Asurvey of mobile phone sensing. Communications Magazine, IEEE, 48(9), sept.2010.

[52] David Lazer, Alex Pentland, Lada Adamic, Sinan Aral, Albert-Lszl Barabsi, DevonBrewer, Nicholas Christakis, Noshir Contractor, James Fowler, Myron Gutmann,Tony Jebara, Gary King, Michael Macy, Deb Roy, and Marshall Van Alstyne.Computational social science. Science, 323(5915):721–723, 2009.

[53] Qinghua Li, Sencun Zhu, and Guohong Cao. Routing in socially selfish delaytolerant networks. In Infocom, 2010.

[54] Kaisen Lin, Aman Kansal, Dimitrios Lymberopoulos, and Feng Zhao.Energy-accuracy trade-off for continuous mobile device location. In MobiSys,2010.

[55] Yue-Hsun Lin and et al. Spate: small-group pki-less authenticated trustestablishment. In MobiSys, 2009.

[56] Anders Lindgren, Avri Doria, and Olov Schelen. Probabilistic routing inintermittently connected networks. LNC, pages 239–254, 2004.

[57] Ashwin Machanavajjhala, Johannes Gehrke, and Daniel Kifer. l-diversity: Privacybeyond k-anonymity. pages 24–24, April 2006.

[58] Sergio Marti, T. J. Giuli, Kevin Lai, and Mary Baker. Mitigating routing misbehaviorin mobile ad hoc networks. In Mobicom, 2000.

130

Page 131: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

[59] Jonathan M. McCune, Adrian Perrig, and Michael K. Reiter. Seeing Is Believing:using camera phones for human authentication. Int. J. Secur. Netw., 4(1/2):43–56,2009.

[60] Miller Mcpherson, Lynn S. Lovin, and James M. Cook. Birds of a feather:Homophily in social networks. Annual Review of Sociology, 27(1):415–444,2001.

[61] Pietro Michiardi, , Pietro Michiardi, and Refik Molva. Simulation-based analysis ofsecurity exposures in mobile ad hoc networks. In European Wireless Conference,2002.

[62] Greg Minshall. Tcpdpriv, 1996, 1996.

[63] Jeffrey C. Mogul and Martin Arlitt. Sc2d: an alternative to trace anonymization. InMineNet ’06: Proceedings of the 2006 SIGCOMM workshop on Mining networkdata, pages 323–328, New York, NY, USA, 2006. ACM.

[64] D. Moore, V. Paxson, S. Savage, C. Shannon, S. Staniford, and N. Weaver. Insidethe slammer worm. Security & Privacy, IEEE, 1(4):33–39, July-Aug. 2003.

[65] Mirco Musolesi and Cecilia Mascolo. A community based mobility model for ad hocnetwork research. In ACM REALMAN, 2006.

[66] Martin O’Connell and Gretchen E Gooding. The use of first names to evaluatereports of gender and its effect on the distribution of married and unmarried couplehouseholds. In Population Association of America (PAA) 2006 Annual Meeting,2006.

[67] A. Panagakis, A. Vaios, and I. Stavrakakis. On the Effects of Cooperation in DTNs.In COMSWARE, 2007.

[68] Jeffrey Pang, Ben Greenstein, Ramakrishna Gummadi, Srinivasan Seshan, andDavid Wetherall. 802.11 user fingerprinting. In MobiCom ’07: Proceedings of the13th annual ACM international conference on Mobile computing and networking,pages 99–110, New York, NY, USA, 2007. ACM.

[69] Ruoming Pang, Mark Allman, Vern Paxson, and Jason Lee. The devil and packettrace anonymization. SIGCOMM Comput. Commun. Rev., 36(1):29–38, 2006.

[70] Vikram Srinivasan Pavan, Vikram Srinivasan, Pavan Nuggehalli, Carla F.Chiasserini, and Ramesh R. Rao. Cooperation in wireless ad hoc networks. InIEEE Infocom, 2003.

[71] Kiran K. Rachuri and et al. Emotionsense: a mobile phones based adaptiveplatform for experimental social psychology research. In Ubicomp, 2010.

[72] Glenn Shafer. Perspectives on the theory and practice of belief functions. Int.Journal of Approximate Reasoning, 1990.

131

Page 132: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

[73] Katie Shilton, Jeffrey A. Burke, Debra Estrin, Mark Hansen, and Mani Srivastava.Participatory privacy in urban sensing. In MODUS: International Workshop onMobile Device and Urban Sensing, 2008.

[74] Douglas C. Sicker, Paul Ohm, and Dirk Grunwald. Legal issues surroundingmonitoring during network research. In IMC ’07: Proceedings of the 7th ACMSIGCOMM conference on Internet measurement, pages 141–148, New York, NY,USA, 2007. ACM.

[75] Neil Spring, Ratul Mahajan, David Wetherall, and Thomas Anderson. Measuringisp topologies with rocketfuel. IEEE/ACM Trans. Netw., 12(1):2–16, 2004.

[76] Latanya Sweeney. k-anonymity: a model for protecting privacy. InternationalJournal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5):557–570,March 2002.

[77] Sapon Tanachaiwiwat and Ahmed Helmy. Worm Propagation and Interactionin Mobile Networks in Handbook on Security and Networks. World ScientificPublishing Co., 2010.

[78] G.S. Thakur, M. Sharma, and A. Helmy. Shield: Social sensing and help inemergency using mobile devices. In GLOBECOM, pages 1–5. IEEE, 2010.

[79] Amin Vahdat and David Becker. Epidemic routing for partially-connected ad hocnetworks. Technical report, Duke University, 2000.

[80] Wei Wang, Vikram Srinivasan, and Mehul Motani. Adaptive contact probingmechanisms for delay tolerant applications. In Proceedings of the 13th annual ACMinternational conference on Mobile computing and networking, MobiCom, 2007.

[81] Jun Xu, Jinliang Fan, Mostafa H. Ammar, and Sue B. Moon. Prefix-preservingip address anonymization: Measurement-based security evaluation and a newcryptography-based scheme. In Computer Networks, pages 280–289, 2002.

[82] Bojan Zdrnja, Nevil Brownlee, and Duane Wessels. Passive monitoring of dnsanomalies. In DIMVA, pages 129–139, 2007.

[83] Kai Zeng, Kannan Govindan, and Prasant Mohapatra. Non-cryptographicauthentication and identification in wireless networks. Wireless Communications,2010.

[84] Sheng Zhong, Jiang Chen, and Richard Yang. Sprite: A Simple, Cheat-Proof,Credit-Based System for Mobile Ad-Hoc Networks. In INFOCOM, 2002.

132

Page 133: IDENTIFYING SOCIAL MARKERS FROM NETWORK DATA BASED …ufdcimages.uflib.ufl.edu/UF/E0/04/48/88/00001/KUMAR_U.pdf · identifying social markers from network data based on location,

BIOGRAPHICAL SKETCH

Udayan Kumar received his B.Tech degree from DA-IICT, Gandhinagar, India and

MS degree in Computer Engineering from University of Florida. He started his PhD in

Computer Engineering at University of Florida in 2008. His research interests include

understanding users’ social behavior from network traces and utilizing the behavior

patterns to develop new insights and applications.

133