Transcript
Page 1: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

Privacy amp Social Media

Lisa Singh PhD Department of Computer Science Georgetown University

Outline

bull Our world on the Internet bull Data privacy in a public profile world bull Methods for determining our web footprints

bull Taking control of our web identities

Our presence on the Internet and social media

72 Billion People in the World

35 Billion Have a Mobile

Device

50

3 Billion Use the Internet 42

2 Billion Use Social Media

29

Data so much datahellip

Users share 70 billion pieces of content each month on Facebook

190 million tweets are sent per day

65 hours of video are uploaded to YouTube every minute

Image from httpwwwpl aybuzzcomjaylam10 which-social-media-fits-your-personality

Privacy settings and social media

bull 25 of Facebook users do not bother with any privacy settings (velocitydigitalcouk 2013)

bull 37 of Facebook users have used the sitersquos privacy toolsto customize how much information apps are allowed tosee (Consumer reports 2012)

bull 40 of teen Facebook users DO NOT set their Facebook profiles to private (friends only) (Pew Study 2013) ndash 71 post their school name ndash 71 post the city or town where they live ndash 53 post their email address ndash 20 post their cell phone number

Consequences of Over-sharing

bull Identity theft bull Online and physical stalking bull Blackmailing bull Negative employment consequences

bull Enabling of snoopers

Data Privacy Expectations

bull We should expect data privacy

bull We should expect freedom from unauthorized use of our data

bull We should expectfreedom from data intrusion

How informative linkable or sensitive is your public profile ndash your web footprint

Gay

Georgetown University

Washington DC

Software Developer

John Smith

John Smith

Divorced

Spanish-speaking

Department of Defense

Republican

Catholic

Your name

Lisa Singh Micah Sherr

Linking dataFacebook

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist

Google+

First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033

Linking dataFacebook

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist

Google+

First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033

Adversaryrsquos Beliefs

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghOccupation DentistFavorite Sports Team RedskinsReligion AtheistRelationship Status MarriedZip Code 22033

What about friends

Startinguser

Listofnamesoffriends

Listofnamesoffriendsforgiven

user

match =numberoverlappingfriendsbetweenusers

site1 site2

[Ramachandran etal2012]

WebFootprint

A1A2A3A4A5A6

A1A2

JohnDoe

A3A4

JohnDoe

A5A6

JohnDoe

Really linking data

Shared Public Attributes

Google+

bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus

bull Genderbull GraduationYear

LinkedIn

bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages

FourSquare

bull Facebookidbull Twitterhandle

bull Emailbull Genderbull Locationbull Phonenumber

bull Relationshipstatus

What do group memberships tell us

What about tweets

bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears

[Singhetal2015]

bull Birthdaybull Genderbull Addressbull Educationbull Hobbies

bull Skillsbull Titlebull Industrybull Educationbull Experience

bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe

leveragedtodeterminetheundisclosedattributesofauser

What about the population

Methodology

bull Sample user profiles from media sites

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules

0

3

6

9

12

15Inferencegain

Inferencegain

What can be inferred from the population

[Mooreetal2013]

LinkedIndataset91150publicprofiles12attributesperprofile

Web Footprinting

Experiments for Understanding Public Profiles

Aboutme - personal website hosting site Each user can make a custom

webpage about themselves Can list links to their social

media profiles on multiple websites

Using their API we collected 124497 peoples information -gt Ground Truth

21

Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles

[Singhetal2015]

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 2: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

Outline

bull Our world on the Internet bull Data privacy in a public profile world bull Methods for determining our web footprints

bull Taking control of our web identities

Our presence on the Internet and social media

72 Billion People in the World

35 Billion Have a Mobile

Device

50

3 Billion Use the Internet 42

2 Billion Use Social Media

29

Data so much datahellip

Users share 70 billion pieces of content each month on Facebook

190 million tweets are sent per day

65 hours of video are uploaded to YouTube every minute

Image from httpwwwpl aybuzzcomjaylam10 which-social-media-fits-your-personality

Privacy settings and social media

bull 25 of Facebook users do not bother with any privacy settings (velocitydigitalcouk 2013)

bull 37 of Facebook users have used the sitersquos privacy toolsto customize how much information apps are allowed tosee (Consumer reports 2012)

bull 40 of teen Facebook users DO NOT set their Facebook profiles to private (friends only) (Pew Study 2013) ndash 71 post their school name ndash 71 post the city or town where they live ndash 53 post their email address ndash 20 post their cell phone number

Consequences of Over-sharing

bull Identity theft bull Online and physical stalking bull Blackmailing bull Negative employment consequences

bull Enabling of snoopers

Data Privacy Expectations

bull We should expect data privacy

bull We should expect freedom from unauthorized use of our data

bull We should expectfreedom from data intrusion

How informative linkable or sensitive is your public profile ndash your web footprint

Gay

Georgetown University

Washington DC

Software Developer

John Smith

John Smith

Divorced

Spanish-speaking

Department of Defense

Republican

Catholic

Your name

Lisa Singh Micah Sherr

Linking dataFacebook

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist

Google+

First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033

Linking dataFacebook

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist

Google+

First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033

Adversaryrsquos Beliefs

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghOccupation DentistFavorite Sports Team RedskinsReligion AtheistRelationship Status MarriedZip Code 22033

What about friends

Startinguser

Listofnamesoffriends

Listofnamesoffriendsforgiven

user

match =numberoverlappingfriendsbetweenusers

site1 site2

[Ramachandran etal2012]

WebFootprint

A1A2A3A4A5A6

A1A2

JohnDoe

A3A4

JohnDoe

A5A6

JohnDoe

Really linking data

Shared Public Attributes

Google+

bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus

bull Genderbull GraduationYear

LinkedIn

bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages

FourSquare

bull Facebookidbull Twitterhandle

bull Emailbull Genderbull Locationbull Phonenumber

bull Relationshipstatus

What do group memberships tell us

What about tweets

bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears

[Singhetal2015]

bull Birthdaybull Genderbull Addressbull Educationbull Hobbies

bull Skillsbull Titlebull Industrybull Educationbull Experience

bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe

leveragedtodeterminetheundisclosedattributesofauser

What about the population

Methodology

bull Sample user profiles from media sites

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules

0

3

6

9

12

15Inferencegain

Inferencegain

What can be inferred from the population

[Mooreetal2013]

LinkedIndataset91150publicprofiles12attributesperprofile

Web Footprinting

Experiments for Understanding Public Profiles

Aboutme - personal website hosting site Each user can make a custom

webpage about themselves Can list links to their social

media profiles on multiple websites

Using their API we collected 124497 peoples information -gt Ground Truth

21

Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles

[Singhetal2015]

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 3: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

Our presence on the Internet and social media

72 Billion People in the World

35 Billion Have a Mobile

Device

50

3 Billion Use the Internet 42

2 Billion Use Social Media

29

Data so much datahellip

Users share 70 billion pieces of content each month on Facebook

190 million tweets are sent per day

65 hours of video are uploaded to YouTube every minute

Image from httpwwwpl aybuzzcomjaylam10 which-social-media-fits-your-personality

Privacy settings and social media

bull 25 of Facebook users do not bother with any privacy settings (velocitydigitalcouk 2013)

bull 37 of Facebook users have used the sitersquos privacy toolsto customize how much information apps are allowed tosee (Consumer reports 2012)

bull 40 of teen Facebook users DO NOT set their Facebook profiles to private (friends only) (Pew Study 2013) ndash 71 post their school name ndash 71 post the city or town where they live ndash 53 post their email address ndash 20 post their cell phone number

Consequences of Over-sharing

bull Identity theft bull Online and physical stalking bull Blackmailing bull Negative employment consequences

bull Enabling of snoopers

Data Privacy Expectations

bull We should expect data privacy

bull We should expect freedom from unauthorized use of our data

bull We should expectfreedom from data intrusion

How informative linkable or sensitive is your public profile ndash your web footprint

Gay

Georgetown University

Washington DC

Software Developer

John Smith

John Smith

Divorced

Spanish-speaking

Department of Defense

Republican

Catholic

Your name

Lisa Singh Micah Sherr

Linking dataFacebook

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist

Google+

First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033

Linking dataFacebook

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist

Google+

First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033

Adversaryrsquos Beliefs

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghOccupation DentistFavorite Sports Team RedskinsReligion AtheistRelationship Status MarriedZip Code 22033

What about friends

Startinguser

Listofnamesoffriends

Listofnamesoffriendsforgiven

user

match =numberoverlappingfriendsbetweenusers

site1 site2

[Ramachandran etal2012]

WebFootprint

A1A2A3A4A5A6

A1A2

JohnDoe

A3A4

JohnDoe

A5A6

JohnDoe

Really linking data

Shared Public Attributes

Google+

bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus

bull Genderbull GraduationYear

LinkedIn

bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages

FourSquare

bull Facebookidbull Twitterhandle

bull Emailbull Genderbull Locationbull Phonenumber

bull Relationshipstatus

What do group memberships tell us

What about tweets

bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears

[Singhetal2015]

bull Birthdaybull Genderbull Addressbull Educationbull Hobbies

bull Skillsbull Titlebull Industrybull Educationbull Experience

bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe

leveragedtodeterminetheundisclosedattributesofauser

What about the population

Methodology

bull Sample user profiles from media sites

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules

0

3

6

9

12

15Inferencegain

Inferencegain

What can be inferred from the population

[Mooreetal2013]

LinkedIndataset91150publicprofiles12attributesperprofile

Web Footprinting

Experiments for Understanding Public Profiles

Aboutme - personal website hosting site Each user can make a custom

webpage about themselves Can list links to their social

media profiles on multiple websites

Using their API we collected 124497 peoples information -gt Ground Truth

21

Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles

[Singhetal2015]

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 4: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

Data so much datahellip

Users share 70 billion pieces of content each month on Facebook

190 million tweets are sent per day

65 hours of video are uploaded to YouTube every minute

Image from httpwwwpl aybuzzcomjaylam10 which-social-media-fits-your-personality

Privacy settings and social media

bull 25 of Facebook users do not bother with any privacy settings (velocitydigitalcouk 2013)

bull 37 of Facebook users have used the sitersquos privacy toolsto customize how much information apps are allowed tosee (Consumer reports 2012)

bull 40 of teen Facebook users DO NOT set their Facebook profiles to private (friends only) (Pew Study 2013) ndash 71 post their school name ndash 71 post the city or town where they live ndash 53 post their email address ndash 20 post their cell phone number

Consequences of Over-sharing

bull Identity theft bull Online and physical stalking bull Blackmailing bull Negative employment consequences

bull Enabling of snoopers

Data Privacy Expectations

bull We should expect data privacy

bull We should expect freedom from unauthorized use of our data

bull We should expectfreedom from data intrusion

How informative linkable or sensitive is your public profile ndash your web footprint

Gay

Georgetown University

Washington DC

Software Developer

John Smith

John Smith

Divorced

Spanish-speaking

Department of Defense

Republican

Catholic

Your name

Lisa Singh Micah Sherr

Linking dataFacebook

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist

Google+

First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033

Linking dataFacebook

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist

Google+

First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033

Adversaryrsquos Beliefs

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghOccupation DentistFavorite Sports Team RedskinsReligion AtheistRelationship Status MarriedZip Code 22033

What about friends

Startinguser

Listofnamesoffriends

Listofnamesoffriendsforgiven

user

match =numberoverlappingfriendsbetweenusers

site1 site2

[Ramachandran etal2012]

WebFootprint

A1A2A3A4A5A6

A1A2

JohnDoe

A3A4

JohnDoe

A5A6

JohnDoe

Really linking data

Shared Public Attributes

Google+

bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus

bull Genderbull GraduationYear

LinkedIn

bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages

FourSquare

bull Facebookidbull Twitterhandle

bull Emailbull Genderbull Locationbull Phonenumber

bull Relationshipstatus

What do group memberships tell us

What about tweets

bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears

[Singhetal2015]

bull Birthdaybull Genderbull Addressbull Educationbull Hobbies

bull Skillsbull Titlebull Industrybull Educationbull Experience

bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe

leveragedtodeterminetheundisclosedattributesofauser

What about the population

Methodology

bull Sample user profiles from media sites

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules

0

3

6

9

12

15Inferencegain

Inferencegain

What can be inferred from the population

[Mooreetal2013]

LinkedIndataset91150publicprofiles12attributesperprofile

Web Footprinting

Experiments for Understanding Public Profiles

Aboutme - personal website hosting site Each user can make a custom

webpage about themselves Can list links to their social

media profiles on multiple websites

Using their API we collected 124497 peoples information -gt Ground Truth

21

Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles

[Singhetal2015]

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 5: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

Privacy settings and social media

bull 25 of Facebook users do not bother with any privacy settings (velocitydigitalcouk 2013)

bull 37 of Facebook users have used the sitersquos privacy toolsto customize how much information apps are allowed tosee (Consumer reports 2012)

bull 40 of teen Facebook users DO NOT set their Facebook profiles to private (friends only) (Pew Study 2013) ndash 71 post their school name ndash 71 post the city or town where they live ndash 53 post their email address ndash 20 post their cell phone number

Consequences of Over-sharing

bull Identity theft bull Online and physical stalking bull Blackmailing bull Negative employment consequences

bull Enabling of snoopers

Data Privacy Expectations

bull We should expect data privacy

bull We should expect freedom from unauthorized use of our data

bull We should expectfreedom from data intrusion

How informative linkable or sensitive is your public profile ndash your web footprint

Gay

Georgetown University

Washington DC

Software Developer

John Smith

John Smith

Divorced

Spanish-speaking

Department of Defense

Republican

Catholic

Your name

Lisa Singh Micah Sherr

Linking dataFacebook

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist

Google+

First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033

Linking dataFacebook

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist

Google+

First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033

Adversaryrsquos Beliefs

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghOccupation DentistFavorite Sports Team RedskinsReligion AtheistRelationship Status MarriedZip Code 22033

What about friends

Startinguser

Listofnamesoffriends

Listofnamesoffriendsforgiven

user

match =numberoverlappingfriendsbetweenusers

site1 site2

[Ramachandran etal2012]

WebFootprint

A1A2A3A4A5A6

A1A2

JohnDoe

A3A4

JohnDoe

A5A6

JohnDoe

Really linking data

Shared Public Attributes

Google+

bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus

bull Genderbull GraduationYear

LinkedIn

bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages

FourSquare

bull Facebookidbull Twitterhandle

bull Emailbull Genderbull Locationbull Phonenumber

bull Relationshipstatus

What do group memberships tell us

What about tweets

bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears

[Singhetal2015]

bull Birthdaybull Genderbull Addressbull Educationbull Hobbies

bull Skillsbull Titlebull Industrybull Educationbull Experience

bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe

leveragedtodeterminetheundisclosedattributesofauser

What about the population

Methodology

bull Sample user profiles from media sites

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules

0

3

6

9

12

15Inferencegain

Inferencegain

What can be inferred from the population

[Mooreetal2013]

LinkedIndataset91150publicprofiles12attributesperprofile

Web Footprinting

Experiments for Understanding Public Profiles

Aboutme - personal website hosting site Each user can make a custom

webpage about themselves Can list links to their social

media profiles on multiple websites

Using their API we collected 124497 peoples information -gt Ground Truth

21

Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles

[Singhetal2015]

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 6: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

Consequences of Over-sharing

bull Identity theft bull Online and physical stalking bull Blackmailing bull Negative employment consequences

bull Enabling of snoopers

Data Privacy Expectations

bull We should expect data privacy

bull We should expect freedom from unauthorized use of our data

bull We should expectfreedom from data intrusion

How informative linkable or sensitive is your public profile ndash your web footprint

Gay

Georgetown University

Washington DC

Software Developer

John Smith

John Smith

Divorced

Spanish-speaking

Department of Defense

Republican

Catholic

Your name

Lisa Singh Micah Sherr

Linking dataFacebook

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist

Google+

First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033

Linking dataFacebook

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist

Google+

First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033

Adversaryrsquos Beliefs

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghOccupation DentistFavorite Sports Team RedskinsReligion AtheistRelationship Status MarriedZip Code 22033

What about friends

Startinguser

Listofnamesoffriends

Listofnamesoffriendsforgiven

user

match =numberoverlappingfriendsbetweenusers

site1 site2

[Ramachandran etal2012]

WebFootprint

A1A2A3A4A5A6

A1A2

JohnDoe

A3A4

JohnDoe

A5A6

JohnDoe

Really linking data

Shared Public Attributes

Google+

bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus

bull Genderbull GraduationYear

LinkedIn

bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages

FourSquare

bull Facebookidbull Twitterhandle

bull Emailbull Genderbull Locationbull Phonenumber

bull Relationshipstatus

What do group memberships tell us

What about tweets

bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears

[Singhetal2015]

bull Birthdaybull Genderbull Addressbull Educationbull Hobbies

bull Skillsbull Titlebull Industrybull Educationbull Experience

bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe

leveragedtodeterminetheundisclosedattributesofauser

What about the population

Methodology

bull Sample user profiles from media sites

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules

0

3

6

9

12

15Inferencegain

Inferencegain

What can be inferred from the population

[Mooreetal2013]

LinkedIndataset91150publicprofiles12attributesperprofile

Web Footprinting

Experiments for Understanding Public Profiles

Aboutme - personal website hosting site Each user can make a custom

webpage about themselves Can list links to their social

media profiles on multiple websites

Using their API we collected 124497 peoples information -gt Ground Truth

21

Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles

[Singhetal2015]

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 7: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

Data Privacy Expectations

bull We should expect data privacy

bull We should expect freedom from unauthorized use of our data

bull We should expectfreedom from data intrusion

How informative linkable or sensitive is your public profile ndash your web footprint

Gay

Georgetown University

Washington DC

Software Developer

John Smith

John Smith

Divorced

Spanish-speaking

Department of Defense

Republican

Catholic

Your name

Lisa Singh Micah Sherr

Linking dataFacebook

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist

Google+

First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033

Linking dataFacebook

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist

Google+

First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033

Adversaryrsquos Beliefs

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghOccupation DentistFavorite Sports Team RedskinsReligion AtheistRelationship Status MarriedZip Code 22033

What about friends

Startinguser

Listofnamesoffriends

Listofnamesoffriendsforgiven

user

match =numberoverlappingfriendsbetweenusers

site1 site2

[Ramachandran etal2012]

WebFootprint

A1A2A3A4A5A6

A1A2

JohnDoe

A3A4

JohnDoe

A5A6

JohnDoe

Really linking data

Shared Public Attributes

Google+

bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus

bull Genderbull GraduationYear

LinkedIn

bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages

FourSquare

bull Facebookidbull Twitterhandle

bull Emailbull Genderbull Locationbull Phonenumber

bull Relationshipstatus

What do group memberships tell us

What about tweets

bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears

[Singhetal2015]

bull Birthdaybull Genderbull Addressbull Educationbull Hobbies

bull Skillsbull Titlebull Industrybull Educationbull Experience

bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe

leveragedtodeterminetheundisclosedattributesofauser

What about the population

Methodology

bull Sample user profiles from media sites

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules

0

3

6

9

12

15Inferencegain

Inferencegain

What can be inferred from the population

[Mooreetal2013]

LinkedIndataset91150publicprofiles12attributesperprofile

Web Footprinting

Experiments for Understanding Public Profiles

Aboutme - personal website hosting site Each user can make a custom

webpage about themselves Can list links to their social

media profiles on multiple websites

Using their API we collected 124497 peoples information -gt Ground Truth

21

Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles

[Singhetal2015]

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 8: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

How informative linkable or sensitive is your public profile ndash your web footprint

Gay

Georgetown University

Washington DC

Software Developer

John Smith

John Smith

Divorced

Spanish-speaking

Department of Defense

Republican

Catholic

Your name

Lisa Singh Micah Sherr

Linking dataFacebook

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist

Google+

First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033

Linking dataFacebook

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist

Google+

First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033

Adversaryrsquos Beliefs

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghOccupation DentistFavorite Sports Team RedskinsReligion AtheistRelationship Status MarriedZip Code 22033

What about friends

Startinguser

Listofnamesoffriends

Listofnamesoffriendsforgiven

user

match =numberoverlappingfriendsbetweenusers

site1 site2

[Ramachandran etal2012]

WebFootprint

A1A2A3A4A5A6

A1A2

JohnDoe

A3A4

JohnDoe

A5A6

JohnDoe

Really linking data

Shared Public Attributes

Google+

bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus

bull Genderbull GraduationYear

LinkedIn

bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages

FourSquare

bull Facebookidbull Twitterhandle

bull Emailbull Genderbull Locationbull Phonenumber

bull Relationshipstatus

What do group memberships tell us

What about tweets

bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears

[Singhetal2015]

bull Birthdaybull Genderbull Addressbull Educationbull Hobbies

bull Skillsbull Titlebull Industrybull Educationbull Experience

bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe

leveragedtodeterminetheundisclosedattributesofauser

What about the population

Methodology

bull Sample user profiles from media sites

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules

0

3

6

9

12

15Inferencegain

Inferencegain

What can be inferred from the population

[Mooreetal2013]

LinkedIndataset91150publicprofiles12attributesperprofile

Web Footprinting

Experiments for Understanding Public Profiles

Aboutme - personal website hosting site Each user can make a custom

webpage about themselves Can list links to their social

media profiles on multiple websites

Using their API we collected 124497 peoples information -gt Ground Truth

21

Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles

[Singhetal2015]

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 9: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

Your name

Lisa Singh Micah Sherr

Linking dataFacebook

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist

Google+

First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033

Linking dataFacebook

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist

Google+

First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033

Adversaryrsquos Beliefs

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghOccupation DentistFavorite Sports Team RedskinsReligion AtheistRelationship Status MarriedZip Code 22033

What about friends

Startinguser

Listofnamesoffriends

Listofnamesoffriendsforgiven

user

match =numberoverlappingfriendsbetweenusers

site1 site2

[Ramachandran etal2012]

WebFootprint

A1A2A3A4A5A6

A1A2

JohnDoe

A3A4

JohnDoe

A5A6

JohnDoe

Really linking data

Shared Public Attributes

Google+

bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus

bull Genderbull GraduationYear

LinkedIn

bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages

FourSquare

bull Facebookidbull Twitterhandle

bull Emailbull Genderbull Locationbull Phonenumber

bull Relationshipstatus

What do group memberships tell us

What about tweets

bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears

[Singhetal2015]

bull Birthdaybull Genderbull Addressbull Educationbull Hobbies

bull Skillsbull Titlebull Industrybull Educationbull Experience

bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe

leveragedtodeterminetheundisclosedattributesofauser

What about the population

Methodology

bull Sample user profiles from media sites

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules

0

3

6

9

12

15Inferencegain

Inferencegain

What can be inferred from the population

[Mooreetal2013]

LinkedIndataset91150publicprofiles12attributesperprofile

Web Footprinting

Experiments for Understanding Public Profiles

Aboutme - personal website hosting site Each user can make a custom

webpage about themselves Can list links to their social

media profiles on multiple websites

Using their API we collected 124497 peoples information -gt Ground Truth

21

Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles

[Singhetal2015]

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 10: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

Linking dataFacebook

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist

Google+

First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033

Linking dataFacebook

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist

Google+

First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033

Adversaryrsquos Beliefs

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghOccupation DentistFavorite Sports Team RedskinsReligion AtheistRelationship Status MarriedZip Code 22033

What about friends

Startinguser

Listofnamesoffriends

Listofnamesoffriendsforgiven

user

match =numberoverlappingfriendsbetweenusers

site1 site2

[Ramachandran etal2012]

WebFootprint

A1A2A3A4A5A6

A1A2

JohnDoe

A3A4

JohnDoe

A5A6

JohnDoe

Really linking data

Shared Public Attributes

Google+

bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus

bull Genderbull GraduationYear

LinkedIn

bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages

FourSquare

bull Facebookidbull Twitterhandle

bull Emailbull Genderbull Locationbull Phonenumber

bull Relationshipstatus

What do group memberships tell us

What about tweets

bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears

[Singhetal2015]

bull Birthdaybull Genderbull Addressbull Educationbull Hobbies

bull Skillsbull Titlebull Industrybull Educationbull Experience

bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe

leveragedtodeterminetheundisclosedattributesofauser

What about the population

Methodology

bull Sample user profiles from media sites

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules

0

3

6

9

12

15Inferencegain

Inferencegain

What can be inferred from the population

[Mooreetal2013]

LinkedIndataset91150publicprofiles12attributesperprofile

Web Footprinting

Experiments for Understanding Public Profiles

Aboutme - personal website hosting site Each user can make a custom

webpage about themselves Can list links to their social

media profiles on multiple websites

Using their API we collected 124497 peoples information -gt Ground Truth

21

Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles

[Singhetal2015]

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 11: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

Linking dataFacebook

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghFavorite Sports Team SeahawksReligion Atheist

Google+

First Name SallyLast Name SmithGender FemaleLocation Georgetown Occupation DentistRelationship Status MarriedZip code 22033

Adversaryrsquos Beliefs

First Name SallyLast Name SmithGender FemaleLocation GeorgetownHometown PittsburghOccupation DentistFavorite Sports Team RedskinsReligion AtheistRelationship Status MarriedZip Code 22033

What about friends

Startinguser

Listofnamesoffriends

Listofnamesoffriendsforgiven

user

match =numberoverlappingfriendsbetweenusers

site1 site2

[Ramachandran etal2012]

WebFootprint

A1A2A3A4A5A6

A1A2

JohnDoe

A3A4

JohnDoe

A5A6

JohnDoe

Really linking data

Shared Public Attributes

Google+

bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus

bull Genderbull GraduationYear

LinkedIn

bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages

FourSquare

bull Facebookidbull Twitterhandle

bull Emailbull Genderbull Locationbull Phonenumber

bull Relationshipstatus

What do group memberships tell us

What about tweets

bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears

[Singhetal2015]

bull Birthdaybull Genderbull Addressbull Educationbull Hobbies

bull Skillsbull Titlebull Industrybull Educationbull Experience

bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe

leveragedtodeterminetheundisclosedattributesofauser

What about the population

Methodology

bull Sample user profiles from media sites

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules

0

3

6

9

12

15Inferencegain

Inferencegain

What can be inferred from the population

[Mooreetal2013]

LinkedIndataset91150publicprofiles12attributesperprofile

Web Footprinting

Experiments for Understanding Public Profiles

Aboutme - personal website hosting site Each user can make a custom

webpage about themselves Can list links to their social

media profiles on multiple websites

Using their API we collected 124497 peoples information -gt Ground Truth

21

Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles

[Singhetal2015]

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 12: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

What about friends

Startinguser

Listofnamesoffriends

Listofnamesoffriendsforgiven

user

match =numberoverlappingfriendsbetweenusers

site1 site2

[Ramachandran etal2012]

WebFootprint

A1A2A3A4A5A6

A1A2

JohnDoe

A3A4

JohnDoe

A5A6

JohnDoe

Really linking data

Shared Public Attributes

Google+

bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus

bull Genderbull GraduationYear

LinkedIn

bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages

FourSquare

bull Facebookidbull Twitterhandle

bull Emailbull Genderbull Locationbull Phonenumber

bull Relationshipstatus

What do group memberships tell us

What about tweets

bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears

[Singhetal2015]

bull Birthdaybull Genderbull Addressbull Educationbull Hobbies

bull Skillsbull Titlebull Industrybull Educationbull Experience

bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe

leveragedtodeterminetheundisclosedattributesofauser

What about the population

Methodology

bull Sample user profiles from media sites

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules

0

3

6

9

12

15Inferencegain

Inferencegain

What can be inferred from the population

[Mooreetal2013]

LinkedIndataset91150publicprofiles12attributesperprofile

Web Footprinting

Experiments for Understanding Public Profiles

Aboutme - personal website hosting site Each user can make a custom

webpage about themselves Can list links to their social

media profiles on multiple websites

Using their API we collected 124497 peoples information -gt Ground Truth

21

Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles

[Singhetal2015]

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 13: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

WebFootprint

A1A2A3A4A5A6

A1A2

JohnDoe

A3A4

JohnDoe

A5A6

JohnDoe

Really linking data

Shared Public Attributes

Google+

bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus

bull Genderbull GraduationYear

LinkedIn

bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages

FourSquare

bull Facebookidbull Twitterhandle

bull Emailbull Genderbull Locationbull Phonenumber

bull Relationshipstatus

What do group memberships tell us

What about tweets

bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears

[Singhetal2015]

bull Birthdaybull Genderbull Addressbull Educationbull Hobbies

bull Skillsbull Titlebull Industrybull Educationbull Experience

bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe

leveragedtodeterminetheundisclosedattributesofauser

What about the population

Methodology

bull Sample user profiles from media sites

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules

0

3

6

9

12

15Inferencegain

Inferencegain

What can be inferred from the population

[Mooreetal2013]

LinkedIndataset91150publicprofiles12attributesperprofile

Web Footprinting

Experiments for Understanding Public Profiles

Aboutme - personal website hosting site Each user can make a custom

webpage about themselves Can list links to their social

media profiles on multiple websites

Using their API we collected 124497 peoples information -gt Ground Truth

21

Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles

[Singhetal2015]

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 14: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

Shared Public Attributes

Google+

bull Companybull Occupationbull Educationbull Locationbull Birthdatebull Relationshipstatus

bull Genderbull GraduationYear

LinkedIn

bull Companybull Locationbull Educationbull Emailbull Occupationbull Skillsbull Industrybull Websitebull Languages

FourSquare

bull Facebookidbull Twitterhandle

bull Emailbull Genderbull Locationbull Phonenumber

bull Relationshipstatus

What do group memberships tell us

What about tweets

bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears

[Singhetal2015]

bull Birthdaybull Genderbull Addressbull Educationbull Hobbies

bull Skillsbull Titlebull Industrybull Educationbull Experience

bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe

leveragedtodeterminetheundisclosedattributesofauser

What about the population

Methodology

bull Sample user profiles from media sites

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules

0

3

6

9

12

15Inferencegain

Inferencegain

What can be inferred from the population

[Mooreetal2013]

LinkedIndataset91150publicprofiles12attributesperprofile

Web Footprinting

Experiments for Understanding Public Profiles

Aboutme - personal website hosting site Each user can make a custom

webpage about themselves Can list links to their social

media profiles on multiple websites

Using their API we collected 124497 peoples information -gt Ground Truth

21

Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles

[Singhetal2015]

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 15: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

What do group memberships tell us

What about tweets

bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears

[Singhetal2015]

bull Birthdaybull Genderbull Addressbull Educationbull Hobbies

bull Skillsbull Titlebull Industrybull Educationbull Experience

bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe

leveragedtodeterminetheundisclosedattributesofauser

What about the population

Methodology

bull Sample user profiles from media sites

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules

0

3

6

9

12

15Inferencegain

Inferencegain

What can be inferred from the population

[Mooreetal2013]

LinkedIndataset91150publicprofiles12attributesperprofile

Web Footprinting

Experiments for Understanding Public Profiles

Aboutme - personal website hosting site Each user can make a custom

webpage about themselves Can list links to their social

media profiles on multiple websites

Using their API we collected 124497 peoples information -gt Ground Truth

21

Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles

[Singhetal2015]

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 16: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

What about tweets

bull A special wish for a special girl HappyBirthdaybull I love Starbuck MangoTeaLemonadebull Go Bears

[Singhetal2015]

bull Birthdaybull Genderbull Addressbull Educationbull Hobbies

bull Skillsbull Titlebull Industrybull Educationbull Experience

bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe

leveragedtodeterminetheundisclosedattributesofauser

What about the population

Methodology

bull Sample user profiles from media sites

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules

0

3

6

9

12

15Inferencegain

Inferencegain

What can be inferred from the population

[Mooreetal2013]

LinkedIndataset91150publicprofiles12attributesperprofile

Web Footprinting

Experiments for Understanding Public Profiles

Aboutme - personal website hosting site Each user can make a custom

webpage about themselves Can list links to their social

media profiles on multiple websites

Using their API we collected 124497 peoples information -gt Ground Truth

21

Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles

[Singhetal2015]

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 17: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

bull Birthdaybull Genderbull Addressbull Educationbull Hobbies

bull Skillsbull Titlebull Industrybull Educationbull Experience

bull Thoughtsbull Ideasbull Interestsbull HobbiesTowhatdegreecansiteleveldatabe

leveragedtodeterminetheundisclosedattributesofauser

What about the population

Methodology

bull Sample user profiles from media sites

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules

0

3

6

9

12

15Inferencegain

Inferencegain

What can be inferred from the population

[Mooreetal2013]

LinkedIndataset91150publicprofiles12attributesperprofile

Web Footprinting

Experiments for Understanding Public Profiles

Aboutme - personal website hosting site Each user can make a custom

webpage about themselves Can list links to their social

media profiles on multiple websites

Using their API we collected 124497 peoples information -gt Ground Truth

21

Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles

[Singhetal2015]

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 18: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

Methodology

bull Sample user profiles from media sites

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

ab rarr cacd rarr ead rarr b

bcdf rarr a

Public Profiles

Step 1Subpopulation

Sampling

Step 2Inference Engine

Construction

UserProfile

InferenceModel

HiddenAttribute-Values

Step 3Determination of Hidden

Attribute-Values

InferenceEngine

InferenceModel

Fig 1 Attribute inference methodology The adversary samples arandom subpopulation of site-level public profiles (left) constructssite-level inference rules and models using the sampled public profiles(center) and applies the inference engine to a targeted userrsquos publicprofile to predict a hidden attribute-value (right)

the value of attribute Aj when restricted(i Aj) = ) Theadversaryrsquos goal is to discover information about i which isnot in the public profile P i

To more formally define the attacker we introduce a truthfunction truth(i) that returns a set id vi1 vi2 vim suchthat (1) id = Pi

ID and (2) either vij = Pij if Pi

j = orotherwise vij is the correct value of attribute Aj for user iIntuitively truth(i) is the complete set of correct values foruser i for attributes ID A1 Am and can contain valuesthat are not in D Hence the adversaryrsquos goal is to infer theset truth(i) P i Note that this includes both attributes thatare restricted using the sitersquos permission system as well asattributes that are unknown (ie null) to the site Toward thatend in this paper we attempt to infer single attribute-valuesusing data available to the adversary

IV ATTRIBUTE INFERENCE METHODOLOGY

Even though social networking sites often include privacysettings that allow a user to control which attributes in her pro-file are disclosed to the public based on the previous literaturepresented in Section II we make the observation that removingsensitive attributes from a public profile is insufficient to ensurethat those attributes are not easily discoverable In this paperwe are interested in understanding how a public attribute orpublic attribute combination can be used to infer hidden valuesTherefore we analyze how effective frequent patterns of asitersquos subpopulation are for inferring sensitive attributes thatare hidden by a particular user

We develop an attribute inference methodology for deter-mining non-published attributes about a targeted user Ourmethodology is based on the premise that an attacker mayexplore the site in question and then use this backgroundknowledge to make inferences about a particular userrsquos non-published attributes Figure 1 shows the three steps in our highlevel attribute inference methodology subpopulation samplinginference engine construction and determination of hiddenattribute-values

A Subpopulation Sampling

The first step of our methodology is to randomly samplea subpopulation of profiles from a site containing a databaseD More formally our subpopulation D0 has a representativesample of the attribute-value pairs of interest from D (ieD0 D) In practice an adversary can trivially obtain asubpopulation sample by using a sitersquos web interface or API

B Site-based Inference Engine Construction and Determina-tion of Hidden Attribute-Values

There are many methods for building a site-based inferenceengine We begin by extending two previously proposed ap-proaches one that uses Latent Dirichlet Allocation (LDA) [4]and another based on a modified Naıve Bayes method [14]We then propose a new approach based on multi-attributeassociation rule mining Finally we consider an ensembleapproach that incorporates all of the different techniques intothe site-level inference engine Construction of the inferenceengine is done offline and infrequently for a particular sitetherefore the cost of generating inference models or rules isnot significantly burdensome to the adversary

To clarify the different methods we will use a toy examplebased on user data presented in Figure 2 In this example Dcontains four attributes id gender relationship and industryThe adversary is interested in determining User 6rsquos industryattribute-value In this scenario User 6 has decided to not makethis attribute-value public Using the site API the adversaryobtains D0 a subset of D containing the public profiles forUsers 1-5 The adversary will now generate his inferenceengine using these public profiles The remainder of thissubsection describes each of the methods that can serve as thebasis for the inference engine that the adversary will build

LDA Nearest Neighbor Inference Chaabane et al usethe Latent Dirichlet Allocation (LDA) generative model toextract semantic links between interest attribute-values [3] Ourvariation of their method is as follows

Each profile idk vk1 vk2 vkq in the subpopulation D0

consists of the attribute-value pairs for some subset of at-tributes in D We begin by considering a particular attributeAq Each attribute has a domain containing a set of values|Aq| = v1 vm In LDA terms we consider eachattribute-value a word For each attribute-value vk we obtainits related Wikipedia categories to enhance the value sets Wefirst retrieve the top relevant article describing the attributevk from a free text index built by the Lemur Search Engine2

over the entire Wikipedia stub contained in the ClueWeb09collection3 The indexrsquos size is approximately 1GB for thecompressed documents Next from each of these articleswe use Wikipedia as an ontology and obtain all the cate-gories and general categories for the top n articles using atoolkit developed by [10] For instance a value ldquosomeonelike yourdquo has top Wikipedia categories ldquoAdele (singer) songsrdquoand ldquoSingles certified septuple platinum by the AustralianRecording Industry Associationrdquo These categories help tocreate the hidden topical structure that will be inferred usingthe observed attribute-values Intuitively the distribution of

2httpwwwlemurprojectorg3httplemurprojectorgclueweb09

bull Makeinferencesusingtheinferenceenginebull Useuserprofilestoconstructaninferenceenginecontainingasetofinferencerules

0

3

6

9

12

15Inferencegain

Inferencegain

What can be inferred from the population

[Mooreetal2013]

LinkedIndataset91150publicprofiles12attributesperprofile

Web Footprinting

Experiments for Understanding Public Profiles

Aboutme - personal website hosting site Each user can make a custom

webpage about themselves Can list links to their social

media profiles on multiple websites

Using their API we collected 124497 peoples information -gt Ground Truth

21

Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles

[Singhetal2015]

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 19: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

0

3

6

9

12

15Inferencegain

Inferencegain

What can be inferred from the population

[Mooreetal2013]

LinkedIndataset91150publicprofiles12attributesperprofile

Web Footprinting

Experiments for Understanding Public Profiles

Aboutme - personal website hosting site Each user can make a custom

webpage about themselves Can list links to their social

media profiles on multiple websites

Using their API we collected 124497 peoples information -gt Ground Truth

21

Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles

[Singhetal2015]

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 20: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

Web Footprinting

Experiments for Understanding Public Profiles

Aboutme - personal website hosting site Each user can make a custom

webpage about themselves Can list links to their social

media profiles on multiple websites

Using their API we collected 124497 peoples information -gt Ground Truth

21

Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles

[Singhetal2015]

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 21: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

Experiments for Understanding Public Profiles

Aboutme - personal website hosting site Each user can make a custom

webpage about themselves Can list links to their social

media profiles on multiple websites

Using their API we collected 124497 peoples information -gt Ground Truth

21

Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles

[Singhetal2015]

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 22: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

Creating Web Footprints Using Google+ Foursquare LinkedIn Profiles

[Singhetal2015]

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 23: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

23

Synonyms can be found

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 24: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

Dbpedia

Synonyms

Meronym

24

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 25: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

Using an Ontology

25

Approximately8000attributeswerematchedupfromtheontology

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 26: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

Taking Control of Our Web Identity and Data

1 Keep your public profile professional2 Change all your social media account settings that have

personal information on them from public to private3 Choose your friends wisely ndash add them selectively4 Join groups related to your professional interests5 Make it difficult for automated tools to link your accounts

eg use different account user names share different information etc

6 Install ad blockers to reduce data about your click through habits

7 Set your browser to not accept cookies from sites that you have not visited before

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 27: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

The world around us

DATAFICATION

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 28: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

Data Ethics

bull Regulationndash We need to hold companies to higher standards

bull Data ethics standardsndash We need discussion debate and possibly a new discipline

bull Catalog of personal datandash Individuals should be able to see correct andor remove

data companies have about them

[Singh2016]

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 29: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

Final Thoughts

bull There is a cultural acceptance of sharing private data publicly bull This is a problem - I have shown you different techniques for

generating web footprints It is too easybull We need new ways to help users understand what data can be

determined about them and help them take control of their information

bull We need to pause and debate online privacy and ethical uses of large-scale human behavioral data

bull We need to develop guidelines and regulations that protect users

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 30: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

We need to take back control of our data

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 31: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

ReferencesJ Zhu S Zhang L Singh H Yang and M Sherr Generating Risk Reduction Recommendations to Decrease Identifiability of Public Online Profiles under submission

A Hian-Cheong L Singh M Sherr H Yang Semantics and Public Information Exposure Detection invited

L Singh H Yang M Sherr A Hian-Cheong K Tian J Zhu and S Zhang Public Information Exposure Detection Helping Users Understand Their Web Footprints International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Paris France EEEACM 2015

L Singh H Yang M Sherr Y Wei A Hian-Cheong K Tian J Zhu S Zhang T Vaidya and E Asgarli Helping Users Understand Their Web Footprints International Conference on World Wide Web - Companion Proceedings World Wide Web (WWW) Florence Italy Poster Paper 2015

W B Moore Y Wei A Orshefsky M Sherr L Singh H Yang Understanding Site-Based Inference Potential for Identifying Hidden Attributes International Conference on Privacy Security Risk and Trust Alexandria VA IEEE Computer Society 2013

J Ferro L Singh M Sherr Identifying individual vulnerability based on public data International Conference on Privacy Security and Trust Tarragona Catalonia Spain IEEE Computer Society 2013

F Nagle L Singh and A Gkoulalas-Divanis EWNI Efficient Anonymization of Vulnerable Individuals in Social Networks Pacific Asian Conference on Knowledge Discovery and Data Mining (PAKDD) Kuala Lumpur Malaysia Springer 2012

A Ramachandran L Singh E Porter and F Nagle Exploring re-identification risks in public domains Conference on Privacy Security and Trust (PST) IEEE Computer Society 2012

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 32: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

The Team amp Support

bull Faculty ndash Lisa Singh Micah Sherr Grace Hui Yan

bull Students amp Researchers ndash Rob Churchill Kristen Skillman Kevin Tian Sicong Zhang

Yanan Zhu

bull Alumni ndash Aditi Ramachandran Frank Nagle John Ferro Yifang Wei Brad

Moore Andrew Hian-Cheong Janet Zhu

Support National Science Foundation

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1

Page 33: Privacy and Social Media - NISTOur presence on the Internet and social media 7.2 Billion People in theWorld 3.5 Billion Have a Mobile Device 50% 3 Billion Use the Internet 42% 2 Billion

5 Reasons to Join Our Program

1 WeareresearchactiveandprovidefullfinancialRAsupportforPhDstudentsfor5year

2 WehavefullandpartialscholarshipsforMasterrsquos students3 Ourcoursesspanappliedandtheoreticalareasofcomputerscienceaswell

asinterdisciplinaryareaslikedatascience4 Wehaveexceptionaljobplacementintoptechfirmsnationallabsand

governmentagencies5 Wehaveastrongcommunityamongstudentsandfaculty

NeedmoreinformationWebsite httpcsgeorgetowneduGraduateDirectorLisaSingh(singhcsgeorgetownedu)

ApplicationdeadlineApril1


Top Related