“You Might Also Like:” Privacy Risks of Collaborative Filtering
I have a new paper titled “You Might Also Like:” Privacy Risks of Collaborative Filtering with Joe Calandrino, Ann Kilzer, Ed Felten and Vitaly Shmatikov. We developed new “statistical inference” techniques and used them to show how the public outputs of online recommender systems, such as the “You Might Also Like” lists you see on many websites, can reveal individual purchases and preferences. Joe spoke about it at the IEEE S&P conference at Oakland earlier today.
Background: inference and statistical inference. The paper is about techniques for inference. At its core, inference is a simple concept, and is about deducing that some event has occured based on its effect on other observable events or objects, often seemingly unrelated. Think Sherlock Holmes, whether something simple such as the idea of a smoking gun, now so well known that it’s a cliché, or something more subtle like the curious incident of the dog in the night time.
Today, inference has evolved a great deal, and in our data-rich world, inference often means statistical inference. Detection of extrasolar planets is a good example of making deductions from the faintest clues: A planet orbiting a star makes the star wobble slightly, which affects the velocity of the star with respect to the Earth. And this relative velocity can be deduced from the displacement in the parent star’s spectral lines due to the Doppler effect, thus inferring the existence of a planet. Crazy!
Web privacy. But back to the paper: what we did was to develop and apply inference techniques in the web context, specifically recommender systems, in a way that no one had thought of before. As you may have noticed, just about every website publicly shows relationships between related items—products, videos, books, news articles, etc.— and these relationships are derived from purchases or views, which are private information. What if the public listings could be reverse engineered, so that we can infer a user’s purchases from them? As the abstract says:
Many commercial websites use recommender systems to help customers locate products and content. Modern recommenders are based on collaborative filtering: they use patterns learned from users’ behavior to make recommendations, usually in the form of related-items lists. The scale and complexity of these systems, along with the fact that their outputs reveal only relationships between items (as opposed to information about users), may suggest that they pose no meaningful privacy risk.
In this paper, we develop algorithms which take a moderate amount of auxiliary information about a customer and infer this customer’s transactions from temporal changes in the public outputs of a recommender system. Our inference attacks are passive and can be carried out by any Internet user. We evaluate their feasibility using public data from popular websites Hunch, Last.fm, LibraryThing, and Amazon.
Consider a user Alice who’s made numerous purchases, some of which she has reviewed publicly. Now she makes a new purchase which she considers sensitive. But this new item, because of her purchasing it, has a nonzero probability of entering the “related items” list of each of the items she has purchased in the past, including the ones she has reviewed publicly. And even if it is already in the related-items list of some of those items, it might improve its rank on those lists because of her purchase. By aggregating dozens or hundreds of these observations, the attacker has a chance of inferring that Alice purchased something, as well as the identity of the item she purchased.
It’s a subtle technique, and the paper has more details than you can shake a stick at if you want to know more.
We evaluated the attacks we developed against several websites of a diverse nature. Numerically, our best results are against Hunch, a recommendation and personalization website. There is a tradeoff between the number of inferences and their accuracy. When optimized for accuracy, our algorithm inferred a third of the test users’ secret answers to Hunch questions with no error. Conversely, if asked to predict the secret answer to every secret question, the algorithm had an accuracy of around 80%.
Impact. It is important to note that we’re not claiming that these sites have serious flaws, or even, in most cases, that they should be doing anything different. On sites other than Hunch—Hunch had an API that provided exact numerical correlations between pairs of items—our attacks worked only on a small proportion of users, although it is sufficient to demonstrate the concept. (Hunch has since eliminated this feature of the API, for reasons unrelated to our research.) We also found that users of larger sites are much safer, because the statistical aggregates are computed from a larger set of users.
But here’s why we think this paper is important:
- Our attack applies to a wide variety of sites—essentially every site with an online catalog of some sort. While we discuss various ways to mitigate the attack in the paper, there is no bulletproof “fix.”
- It undermines the widely accepted dichotomy between “personally identiﬁable” individual records and “safe,” large-scale, aggregate statistics. Furthermore, it demonstrates that the dynamics of aggregate outputs (i.e., their variation with time) constitute a new vector for privacy breaches. Dynamic behavior of high-dimensional aggregates like item similarity lists falls beyond the protections offered by any existing privacy technology, including differential privacy.
- It underscores the fact that modern systems have vast “surfaces” for attacks on privacy, making it difﬁcult to protect ﬁne-grained information about their users. Unintentional leaks of private information are akin to side-channel attacks: it is very hard to enumerate all aspects of the system’s publicly observable behavior which may reveal information about individual users.
That last point is especially interesting to me. We’re leaving digital breadcrumbs online all the time, whether we like it or not. And while algorithms to piece these trails together might seem sophisticated today, they will probably look mundane in a decade or two if history is any indication. The conversation around privacy has always centered around the assumption that we can build technological tools to give users—at least informed users—control over what they reveal about themselves, but our work suggests that there might be fundamental limits to those tools.
See also: Joe Calandrino’s post about this paper.