Posts tagged ‘netflix’
Today is a sad day. It is also a day of hope.
It is a sad day because the second Netflix challenge had to be cancelled. We never thought it would come to this. One of us has publicly referred to the dampening of research as the “worst possible outcome” of privacy studies. As researchers, we are true believers in the power of data to benefit mankind.
We published the initial draft of our de-anonymization study just two weeks after the dataset for the first Netflix Prize became public. Since we had the math to back up our claims, we assumed that lessons would be learned, and that if there were to be a second data release, it would either involve only customers who opted in, or a privacy-preserving data analysis mechanism. That was three and a half years ago.
Instead, you brushed off our claims, calling them “absolutely without merit,” among other things. It has taken negative publicity and an FTC investigation to stop things from getting worse. Some may make the argument that even if the privacy of some of your customers is violated, the benefit to mankind outweighs it, but the “greater good” argument is a very dangerous one. And so here we are.
We were pleasantly surprised to read the plain, unobfuscated language in the blog post announcing the cancellation of the second contest. We hope that this signals a change in your outlook with respect to privacy. We are happy to see that you plan to “continue to explore ways to collaborate with the research community.”
Running something like the Netflix Prize competition without compromising privacy is a hard problem, and you need the help of privacy researchers to do it right. Fortunately, there has been a great deal of research on “differential privacy,” some of it specific to recommender systems. But there are practical challenges, and overcoming them will likely require setting up an online system for data analysis rather than an “anonymize and release” approach.
Data privacy researchers will be happy to work with you rather than against you. We believe that this can be a mutually beneficial collaboration. We need someone with actual data and an actual data-mining goal in order to validate our ideas. You will be able to move forward with the next competition, and just as importantly, it will enable you to become a leader in privacy-preserving data analysis. One potential outcome could be an enterprise-ready system which would be useful to any company or organization that outsources analysis of sensitive customer data.
It’s not often that a moral imperative aligns with business incentives. We hope that you will take advantage of this opportunity.
Arvind Narayanan and Vitaly Shmatikov
For background, see our paper and FAQ.
When trying to find someone in an ‘anonymous’ collection of data, two major questions need to be answered:
- Which is the best match among all the data records, given what I know about the person I’m looking for?
- Is the match meaningful, and not an unrelated record that co-incidentally happens to be similar?
The first question is conceptually simple: one needs to come up with a “scoring function” that compares two sets of data and produces a numerical match score. However, this needs to be done with domain-specific knowledge. In the case of the Netflix dataset, for instance, the scoring function incorporated our intuitions about how long a person might take to review a movie after watching it.
The second question is harder, put perhaps somewhat surprisingly, can be done in a domain-independent way. This is the notion of “eccentricity” that we came up with, and it might be of independent interest. During my talks, I could see that there was some confusion and misunderstanding time and again; hence this post.
The concept behind eccentricity is to measure how much the matching record “stands out” from the crowd. One way to do that would be do measure the difference between the top score and the mean score. (As a multiple of the standard deviation. You always need to divide by the standard deviation to derive a dimensionless quantity.)
The problem with this intuitive measure is that in a large enough collection of data, there will always be entries that have a high enough matching score, purely by chance. To be able to model what scores you’d expect by chance, you need to know everything about how people rate movies (or browse the web, or the equivalent in your dataset) and the correlations between them. And that’s clearly impossible.
The trick is to look at the difference between the best match and the second best match as a multiple of the standard deviation. If the scores are distributed according to an exponential distribution (that’s what you’d expect in this case by pure chance, not a Gaussian), then the difference between the top two matches also follows an exponential distribution. That’s a nifty little lemma.
So, if the best match is 10 standard deviations away from the second best, it argues very strongly against the “null hypothesis,” which is that the match occured by fluke. Visually, the results of applying eccentricity are immediately apparent:
Perhaps the reason that eccentricity is at first counterintuitive is that it looks at only two items and throws away the rest of the numbers. But this is common in statistical estimation. Here’s a puzzle that might help: given n samples a1, a2, … an from the interval [0, x], where x is unknown, what is the best predictor of x?
–Select whitespace below for answer–
Answer: max(ai) * (n+1)/n.
In other words, throw away all but one of the samples! Counterintuitive?
David Molnar pointed me to an article in the Shidler Journal of Law that prominently cites the Netflix dataset de-anonymization paper. I’m very happy to see this; when we wrote our paper, we were hoping to see the legal community analyze the implications of our work for privacy laws. As the article notes:
Re-identification of anonymized data with individual consumers may expose companies to increased liability. If data is re-identified, this may be due to the failure of companies to take reasonable precautions to protect consumer data. In addition, companies may violate their own privacy policies by releasing anonymous information to third parties that can be easily re-identified with individual users.
New lines will need to be drawn defining what is acceptable data-release policy, and in a way that takes into account the actual re-identification risk instead of relying on syntactic crutches such as removing “personally identifiable” information. Perhaps there will need to be a constant process of evaluating and responding to continuing improvements in re-identification algorithms.
Perhaps the ability of third parties to discover information about an individual’s movie rankings is not too disturbing, as movie rankings are not generally considered to be sensitive information. But because these same techniques can lead to the re-identification of data, far greater privacy concerns are implicated.
Indeed, since we wrote our paper, there have been several high profile cases in the news or in the courts where our re-identification techniques can be used to cause much more sensitive privacy breaches, including the Google-Viacom lawsuit involving Youtube viewer logs and the targeted advertising companies Phorm and Nebuad. While the lessons of our paper have begun to propagate “downstream” to the realms of law, advocacy and policy, it has come too late to make a difference in the above examples.
Part of the reason why I started this blog is in the hope of accelerating this process by reaching out to people outside the computer science community. While our papers might be couched in technical language, the results of our research are general enough to be easily accessible to a broad audience, and I hope that this blog will become a central point for disseminating information more broadly.