Posts tagged ‘k-anonymity’
In an abstract sense, re-identifying a record in an anonymized collection using a piece of auxiliary information is nothing more than identifying which of N vectors best matches a given vector. As such, it is related to many well-studied problems from other areas of information science: the record linkage problem in statistics and census studies, the search problem in information retrieval, the classification problem in machine learning, and finally, biometric identification. Noticing inter-disciplinary connections is often very illuminating and sometimes leads to breakthroughs, but I fear that in the case of re-identification, these connections have done more harm than good.
Record linkage and k-anonymity. Sweeney‘s well-known experiment with health records was essentially an exercise in record linkage. The re-identification technique used was the simplest possible — a database JOIN. The unfortunate consequence was that for many years, the anonymization problem was overgeneralized based on that single experiment. In particular, it led to the development of two related and heavily flawed notions: k-anonymity and quasi-identifier.
The main problem with k-anonymity it is that it attempts avoid privacy breaches via purely syntactic manipulations to the data, without any model for reasoning about the ‘adversary’ or attacker. A future post will analyze the limitations of k-anonymity in more detail. ‘Quasi-identifier’ is a notion that arises from attempting to see some attributes (such as ZIP code) but not others (such as tastes and behavior) as contributing to re-identifiability. However, the major lesson from the re-identification papers of the last few years has been that any information at all about a person can be potentially used to aid re-identification.
Movie ratings and noise. Let’s move on to other connections that turned out to be red herrings. Prior to our Netflix paper, Frankowski et al. studied de-anonymization of users via movie ratings collected as part of the GroupLens research project. Their algorithm achieved some success, but failed when noise was added to the auxiliary information. I believe this to be because the authors modeled re-identification as a search problem (I have no way to know if that was their mental model, but the algorithms they came up with seem inspired by the search literature.)
What does it mean to view re-identification as a search problem? A user’s anonymized movie preference record is treated as the collection of words on a web page, and the auxiliary information (another record of movie preferences, from a different database) is treated as a list of search terms. The reason this approach fails is that in the movie context, users typically enter distinct, albeit overlapping, sets of information into different sites or sources. This leads to a great deal of ‘noise’ that the algorithm must deal with. While noise in web pages is of course an issue for web search, noise in the search terms themselves is not. That explains why search algorithms come up short when applied to re-identification.
The robustness against noise was the key distinguishing element that made the re-identification attack in the Netflix paper stand out from most previous work. Any re-identification attack that goes beyond Sweeney-style demographic attributes must incorporate this as a key feature. ‘Fuzzy’ matching is tricky, and there is no universal algorithm that can be used. Rather, it needs to be tailored to the type of dataset based on an understanding of human behavior.
Hope for authorship recognition. Now for my final example. I’m collaborating with other researchers, including John Bethencourt and Emil Stefanov, on some (currently exploratory) investigations into authorship recognition (see my post on De-anonymizing the Internet). We’ve been wondering why progress in existing papers seems to hit a wall at around 100 authors, and how we can break past this limit and carry out de-anonymization on a truly Internet scale. My conjecture is that most previous papers hit the wall because they framed authorship recognition as a classification problem, which is probably the right model for forensics applications. For breaking Internet anonymity, however, this model is not appropriate.
In a de-anonymization problem, if you only succeed for some fraction of the authors, but you do so in a verifiable way, i.e, your algorithm either says “Here is the identity of X” or “I am unable to de-anonymize X”, that’s great. In a classification problem, that’s not acceptable. Further, in de-anonymization, if we can reduce the set of candidate identities for X from a million to (say) 10, that’s fantastic. In a classification problem, that’s a 90% error rate.
These may seem like minor differences, but they radically affect the variety of features that we are able to use. We can throw in a whole lot of features that only work for some authors but not for others. This is why I believe that Internet-scale text de-anonymization is fundamentally possible, although it will only work for a subset of users that cannot be predicted beforehand.
Re-identification science. Paul Ohm refers to what I and other researchers do as “re-identification science.” While this is flattering, I don’t think we’ve done enough to deserve the badge. But we need to change that, because efforts to understand re-identification algorithms by reducing them to known paradigms have been unsuccessful, as I have shown in this post. Among other things, we need to better understand the theoretical limits of anonymization and to extract the common principles underlying the more complex re-identification techniques developed in recent years.
Thanks to Vitaly Shmatikov for reviewing an earlier draft of this post.