“you might also like: …”

“You might also like: …”Privacy risks of collaborative filtering

Yuval Madar, June 2012

Based on a paper by J.A. Calandrino, A. Kilzer, A. Narayanan, E. W. Felten & V. Shmatikov

Help suggesting users items and other users to their liking by deducing them from previous purchases.

Numerous examples◦ Amazon, iTunes, CNN, Last.fm, Pandora, Netflix,

Youtube, Hunch, Hulu, LibraryThing, IMDb and many others.

The setting: Recommendation systems

User to item – “You might also like…” Item to item – Similar items list User to user – Another customer with

common interests

Recommendation types

Content Based filtering◦ Based on A-priori similarity between items, and

recommendations are derived from a user’s own history.

◦ Doesn’t pose a privacy threat. Collaborative filtering

◦ Based on correlations between other uses purchases.

◦ Our attacks will target this type of systems. Hybrid

◦ A system employing both filtering techniques

Recomandation methods

The data the recommendation system uses is modeled as a matrix, where the rows correspond to users, and columns correspond to items.

Some auxiliary information on the target user is available (A subset of a target user’s transaction history)

An attack is successful, if it allows the attacker to learn transactions not part of its auxiliary information.

Attack Model

User public rating and comments on products

Shared transactions (Via facebook, or other mediums)

Discussions in 3rd party sites Favorite books in facebook profile Non-online interactions (With friends,

neighbors, coworkers, etc.) Other sources…

Sources for auxiliary information

Input:◦ a set of target items T and a set of auxiliary items A

Observe the related items list of A, until an item in T appears, or moves up.

If a target item appears in enough related items lists in the same time, the attacker may infer it was bought by the target user.

Note 1 – Scoring may be far more complex, since different items in A are correlated. (Books which belong to a single series, bundle discounts, etc.)

Note 2 – It is preferable that A consist of obscure and uncommon items, to improve the effect of the target user’s choices on its related items lists.

Attack 1 - Related-items list inference

In some sites, the covariance matrix, describing the correlation between items in the site, is exposed to the users. (Hunch is one such website)

Similarly, the attacker is required to watch for improvement in the correlation between the auxiliary items and the target items.

Note 1 – Asynchronous updates to different matrix cells.

Note2 – inference probability improves if the auxiliary items are user-unique. (No other user bought all auxiliary items) More likely if some of them are unpopular, or if there are enough of them.

Attack 2 - Covariance Matrix inference

System model◦ For each user, the system finds the k users most similar to it, and ranks items

purchased by them by total number of sales. Active Attack

Create k dummy users, each buying all known auxiliary items.

With high probability, the k dummy users and the target user will be clustered together. (Given auxiliary items list of size logarithmic in the total number of users. In practice, 8 items were found to be enough for most sites)

In that case, the recommendations to the dummy users will consist of transactions of the target user previously unknown to the attacker.

Note – The attack is more feasible in a system where user interactions with items does not involve spending money.

Attack 3 - kNN recommender systems inference

The main parameters for evaluation of an inference attack are:◦ Yield – How many inferences are produced.◦ Accuracy – How likely is each inference.

Yield-accuracy tradeoff - stricter accuracy algorithms reject less probable inferences.

Attack Metrics

The paper further discusses specific attacks performed against:◦ Hunch◦ LibraryThing◦ Last.fm◦ Amazon

And measures the accuracy and yield of these attacks, arriving in some instances to impressive tradeoff figures. (Such as 70% accuracy for 100% yield in Hunch)

Actual attacks

Not discussed in the paper.

Achieved in other papers for static recommendation databases.

Remains an open problem for dynamic systems. (Which all real world examples are)

Differential privacy

Limited-length related items list – The first elements of such lists have low sensitivity to single purchases.

Factoring item popularity into update frequency – less popular items are more sensitive to single purchases. Batching their purchases together will decrease the information leak.

Decreasing the privacy risk

Limit data access rate – Preventing large-scale privacy attacks, though lowering utility and may be circumvented using a botnet.

User opt-out – A privacy conscious user may decide to opt-out of recommender systems entirely. (At clear cost of utility)

Decreasing the privacy risk

A passive attack on recommender systems using auxiliary information on a certain user’s purchases, allowing the attacker to infer undisclosed private transactions.

Increased user awareness is required

Suggested several methods to decrease the information leaked by these systems

Conclusion

Questions?

“you might also like: …”

Documents