“You Might Also Like:” Privacy Risks of Collaborative Filtering

May 24, 2011 at 6:11 pm 4 comments

I have a new paper titled “You Might Also Like:” Privacy Risks of Collaborative Filtering with Joe Calandrino, Ann Kilzer, Ed Felten and Vitaly Shmatikov. We developed new “statistical inference” techniques and used them to show how the public outputs of online recommender systems, such as the “You Might Also Like” lists you see on many websites, can reveal individual purchases and preferences. Joe spoke about it at the IEEE S&P conference at Oakland earlier today.

Background: inference and statistical inference. The paper is about techniques for inference. At its core, inference is a simple concept, and is about deducing that some event has occured based on its effect on other observable events or objects, often seemingly unrelated. Think Sherlock Holmes, whether something simple such as the idea of a smoking gun, now so well known that it’s a cliché, or something more subtle like the curious incident of the dog in the night time.

Today, inference has evolved a great deal, and in our data-rich world, inference often means statistical inference. Detection of extrasolar planets is a good example of making deductions from the faintest clues: A planet orbiting a star makes the star wobble slightly, which affects the velocity of the star with respect to the Earth. And this relative velocity can be deduced from the displacement in the parent star’s spectral lines due to the Doppler effect, thus inferring the existence of a planet. Crazy!

Web privacy. But back to the paper: what we did was to develop and apply inference techniques in the web context, specifically recommender systems, in a way that no one had thought of before. As you may have noticed, just about every website publicly shows relationships between related items—products, videos, books, news articles, etc.— and these relationships are derived from purchases or views, which are private information. What if the public listings could be reverse engineered, so that we can infer a user’s purchases from them? As the abstract says:

Many commercial websites use recommender systems to help customers locate products and content. Modern recommenders are based on collaborative filtering: they use patterns learned from users’ behavior to make recommendations, usually in the form of related-items lists. The scale and complexity of these systems, along with the fact that their outputs reveal only relationships between items (as opposed to information about users), may suggest that they pose no meaningful privacy risk.

In this paper, we develop algorithms which take a moderate amount of auxiliary information about a customer and infer this customer’s transactions from temporal changes in the public outputs of a recommender system. Our inference attacks are passive and can be carried out by any Internet user.  We evaluate their feasibility using public data from popular websites Hunch, Last.fm, LibraryThing, and Amazon.

The screenshot below shows an example of a related-items list on Amazon. There are up to 100 items in such lists.

Consider a user Alice who’s made numerous purchases, some of which she has reviewed publicly. Now she makes a new purchase which she considers sensitive. But this new item, because of her purchasing it, has a nonzero probability of entering the “related items” list of each of the items she has purchased in the past, including the ones she has reviewed publicly. And even if it is already in the related-items list of some of those items, it might improve its rank on those lists because of her purchase. By aggregating dozens or hundreds of these observations, the attacker has a chance of inferring that Alice purchased something, as well as the identity of the item she purchased.

It’s a subtle technique, and the paper has more details than you can shake a stick at if you want to know more.

We evaluated the attacks we developed against several websites of a diverse nature. Numerically, our best results are against Hunch, a recommendation and personalization website. There is a tradeoff between the number of inferences and their accuracy. When optimized for accuracy, our algorithm inferred a third of the test users’ secret answers to Hunch questions with no error. Conversely, if asked to predict the secret answer to every secret question, the algorithm had an accuracy of around 80%.

Impact. It is important to note that we’re not claiming that these sites have serious flaws, or even, in most cases, that they should be doing anything different. On sites other than Hunch—Hunch had an API that provided exact numerical correlations between pairs of items—our attacks worked only on a small proportion of users, although it is sufficient to demonstrate the concept. (Hunch has since eliminated this feature of the API, for reasons unrelated to our research.) We also found that users of larger sites are much safer, because the statistical aggregates are computed from a larger set of users.

But here’s why we think this paper is important:

  • Our attack applies to a wide variety of sites—essentially every site with an online catalog of some sort. While we discuss various ways to mitigate the attack in the paper, there is no bulletproof “fix.”
  • It undermines the widely accepted dichotomy between “personally identifiable” individual records and “safe,” large-scale, aggregate statistics. Furthermore, it demonstrates that the dynamics of aggregate outputs (i.e., their variation with time) constitute a new vector for privacy breaches. Dynamic behavior of high-dimensional aggregates like item similarity lists falls beyond the protections offered by any existing privacy technology, including differential privacy.
  • It underscores the fact that modern systems have vast “surfaces” for attacks on privacy, making it difficult to protect fine-grained information about their users. Unintentional leaks of private information are akin to side-channel attacks: it is very hard to enumerate all aspects of the system’s publicly observable behavior which may reveal information about individual users.

That last point is especially interesting to me. We’re leaving digital breadcrumbs online all the time, whether we like it or not. And while algorithms to piece these trails together might seem sophisticated today, they will probably look mundane in a decade or two if history is any indication. The conversation around privacy has always centered around the assumption that we can build technological tools to give users—at least informed users—control over what they reveal about themselves, but our work suggests that there might be fundamental limits to those tools.

See also: Joe Calandrino’s post about this paper.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

Entry filed under: Uncategorized. Tags: , , , , .

Insights on fighting “Protect IP” from a Q&A with Congresswoman Lofgren Price Discrimination is All Around You

4 Comments Add your own

  • 1. shri  |  May 29, 2011 at 4:20 pm

    Interesting attack. But how would an adversary get access to the recommender system’s output for a particular user? Usually the access to the output of the recommender system for a user is limited, isn’t it? E.g., Netflix will show recommendations for a user only when that user logs in.

    Reply
    • 2. Arvind Narayanan  |  May 29, 2011 at 6:18 pm

      Shri,
      The adversary uses the output that’s shown to all users, not a particular user. See the screenshot showing “customers who bought this item also bought”—these are simply lists of items similar to other items, and aren’t specific to a user. The paper has more details.

      Reply
  • 3. ab  |  June 15, 2011 at 3:07 am

    While the results are interesting, I believe the paper crosses fine line between concern and needless paranoia. Functioning in society as a normal human being automatically exposes a person to certain risks. The risks of such web recommender systems is extremely small.

    Privacy is sadly being studied similar to cryptography (and by people who study cryptography), while a perfect provably secure communication scheme is possible, it is impossible to have a functioning society with complete privacy. Privacy is more of a social issue, and should be dealt with by regulations and laws. Humanity has advanced because we shared information.
    Synthetic data sets, others privacy preserving techniques have very limited utility. This kind of research only promotes tin foil hat behavior.

    I sometime wonder if your research is motivated by your libertarian beliefs.

    Reply
    • 4. Arvind Narayanan  |  June 15, 2011 at 3:25 am

      1. We explicitly mention in the paper as I did in this blog post, that

      “It is important to note that we’re not claiming that these sites have serious flaws, or even, in most cases, that they should be doing anything different.”

      The point of this paper was to study and quantify the information leakage from today’s complex systems, not to argue that they should be reengineered. I think we made that clear, and your charge of “paranoia” is unjustified.

      Incidentally, I went the extra mile and refused all media inquiries about this paper to avoid the possibility that it might be reported as a privacy threat.

      2. “Functioning in society as a normal human being automatically exposes a person to certain risks.”

      That is precisely the conclusion I’m drawing in the last paragraph! Science is often about quantifying what one intuitively suspects.

      3. My beliefs have nothing to do with this paper and your accusation is ridiculous.

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


About 33bits.org

I'm an assistant professor of computer science at Princeton. I research (and teach) information privacy and security, and moonlight in technology policy.

This is a blog about my research on breaking data anonymization, and more broadly about information privacy, law and policy.

For an explanation of the blog title and more info, see the About page.

Subscribe

Be notified when there's a new post — subscribe to the feed, follow me on Google+ or twitter or use the email subscription box below.

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 217 other followers