Posts tagged ‘author recognition’
In an abstract sense, re-identifying a record in an anonymized collection using a piece of auxiliary information is nothing more than identifying which of N vectors best matches a given vector. As such, it is related to many well-studied problems from other areas of information science: the record linkage problem in statistics and census studies, the search problem in information retrieval, the classification problem in machine learning, and finally, biometric identification. Noticing inter-disciplinary connections is often very illuminating and sometimes leads to breakthroughs, but I fear that in the case of re-identification, these connections have done more harm than good.
Record linkage and k-anonymity. Sweeney‘s well-known experiment with health records was essentially an exercise in record linkage. The re-identification technique used was the simplest possible — a database JOIN. The unfortunate consequence was that for many years, the anonymization problem was overgeneralized based on that single experiment. In particular, it led to the development of two related and heavily flawed notions: k-anonymity and quasi-identifier.
The main problem with k-anonymity it is that it attempts avoid privacy breaches via purely syntactic manipulations to the data, without any model for reasoning about the ‘adversary’ or attacker. A future post will analyze the limitations of k-anonymity in more detail. ‘Quasi-identifier’ is a notion that arises from attempting to see some attributes (such as ZIP code) but not others (such as tastes and behavior) as contributing to re-identifiability. However, the major lesson from the re-identification papers of the last few years has been that any information at all about a person can be potentially used to aid re-identification.
Movie ratings and noise. Let’s move on to other connections that turned out to be red herrings. Prior to our Netflix paper, Frankowski et al. studied de-anonymization of users via movie ratings collected as part of the GroupLens research project. Their algorithm achieved some success, but failed when noise was added to the auxiliary information. I believe this to be because the authors modeled re-identification as a search problem (I have no way to know if that was their mental model, but the algorithms they came up with seem inspired by the search literature.)
What does it mean to view re-identification as a search problem? A user’s anonymized movie preference record is treated as the collection of words on a web page, and the auxiliary information (another record of movie preferences, from a different database) is treated as a list of search terms. The reason this approach fails is that in the movie context, users typically enter distinct, albeit overlapping, sets of information into different sites or sources. This leads to a great deal of ‘noise’ that the algorithm must deal with. While noise in web pages is of course an issue for web search, noise in the search terms themselves is not. That explains why search algorithms come up short when applied to re-identification.
The robustness against noise was the key distinguishing element that made the re-identification attack in the Netflix paper stand out from most previous work. Any re-identification attack that goes beyond Sweeney-style demographic attributes must incorporate this as a key feature. ‘Fuzzy’ matching is tricky, and there is no universal algorithm that can be used. Rather, it needs to be tailored to the type of dataset based on an understanding of human behavior.
Hope for authorship recognition. Now for my final example. I’m collaborating with other researchers, including John Bethencourt and Emil Stefanov, on some (currently exploratory) investigations into authorship recognition (see my post on De-anonymizing the Internet). We’ve been wondering why progress in existing papers seems to hit a wall at around 100 authors, and how we can break past this limit and carry out de-anonymization on a truly Internet scale. My conjecture is that most previous papers hit the wall because they framed authorship recognition as a classification problem, which is probably the right model for forensics applications. For breaking Internet anonymity, however, this model is not appropriate.
In a de-anonymization problem, if you only succeed for some fraction of the authors, but you do so in a verifiable way, i.e, your algorithm either says “Here is the identity of X” or “I am unable to de-anonymize X”, that’s great. In a classification problem, that’s not acceptable. Further, in de-anonymization, if we can reduce the set of candidate identities for X from a million to (say) 10, that’s fantastic. In a classification problem, that’s a 90% error rate.
These may seem like minor differences, but they radically affect the variety of features that we are able to use. We can throw in a whole lot of features that only work for some authors but not for others. This is why I believe that Internet-scale text de-anonymization is fundamentally possible, although it will only work for a subset of users that cannot be predicted beforehand.
Re-identification science. Paul Ohm refers to what I and other researchers do as “re-identification science.” While this is flattering, I don’t think we’ve done enough to deserve the badge. But we need to change that, because efforts to understand re-identification algorithms by reducing them to known paradigms have been unsuccessful, as I have shown in this post. Among other things, we need to better understand the theoretical limits of anonymization and to extract the common principles underlying the more complex re-identification techniques developed in recent years.
Thanks to Vitaly Shmatikov for reviewing an earlier draft of this post.
I’ve been thinking about this problem for quite a while: is it possible to de-anonymize text that is posted anonymously on the Internet by matching the writing style with other Web pages/posts where the authorship is known? I’ve discussed this with many privacy researchers but until recently never written anything down. When someone asked essentially the same question on Hacker News, I barfed up a stream of thought on the subject :-) Here it is, lightly edited.
Each one of us has a writing style that is idiosyncratic enough to have a unique “fingerprint”. However, it is an open question whether it can be efficiently extracted.
The basic idea for constructing a fingerprint is this. Consider two words that are nearly interchangeable, say ‘since’ and ‘because’. Different people use the two words in a differing proportion. By comparing the relative frequency of the two words, you get a little bit of information about a person, typically under 1 bit. But by putting together enough of these ‘markers’, you can construct a profile.
The beginning of modern, rigorous research in this field was by Mosteller and Wallace in 1964: they identified the author of the disputed Federalist papers, almost 200 years after they were written (note that there were only three possible candidates!). They got on the cover of TIME, apparently. Other “coups” for writing-style de-anonymization are the identification of the author of Primary Colors, as well as the unabomber (his brother recognized his style, it wasn’t done by statistical/computational means).
The current state of the art is summarized in this bibliography. Now, that list stops at 2005, but I’m assuming there haven’t been earth-shattering changes since then. I’m familiar with the results from those papers; the curious thing is that they stop at corpuses of a couple hundred authors or so — i.e, identifying one anonymous poster out of say 200, rather than a million. This is probably because they had different applications in mind, such as identification within a company, instead of Internet-scale de-anonymization. Note that the amount of information you need is always logarithmic in the potential number of authors, and so if you can do 200 authors you can almost definitely push it to a few tens of thousands of authors.
The other interesting thing is that the papers are fixated with ‘topic-free’ identification, where the texts aren’t about a particular topic, making the problem harder. The good news is that when you’re doing this Internet-scale, nobody is stopping you from using topic information, making it a lot easier.
So my educated guess is that Internet-scale writing style de-anonymization is possible. However, you’d need fairly long texts, perhaps a page or two. It’s doubtful that anything can be done with a single average-length email.
It think it’s likely that one can build a tool to protect anonymity by taking a chunk of writing and removing your fingerprint from it, but it will need a lot of work, and will probably lead to a cat-and-mouse game between improved de-anonymization and obfuscation techniques. Note the caveats, however: most ordinary people will not have the foreknowledge to find and use such a tool. Second, think of all the compromising posts — rants about employers, accounts from cheating spouses, political dissent, etc. — that have already been written. The day will come when some kid will download a script, let a crawler loose on the web, and post the de-anonymized results for all to see. There will be interesting consequences.
If you’re interested in working on this problem–either writing style analysis for breaking anonymity or obfuscation techniques for protecting anonymity–drop me a line.