Posts tagged ‘theory’
In an abstract sense, re-identifying a record in an anonymized collection using a piece of auxiliary information is nothing more than identifying which of N vectors best matches a given vector. As such, it is related to many well-studied problems from other areas of information science: the record linkage problem in statistics and census studies, the search problem in information retrieval, the classification problem in machine learning, and finally, biometric identification. Noticing inter-disciplinary connections is often very illuminating and sometimes leads to breakthroughs, but I fear that in the case of re-identification, these connections have done more harm than good.
Record linkage and k-anonymity. Sweeney‘s well-known experiment with health records was essentially an exercise in record linkage. The re-identification technique used was the simplest possible — a database JOIN. The unfortunate consequence was that for many years, the anonymization problem was overgeneralized based on that single experiment. In particular, it led to the development of two related and heavily flawed notions: k-anonymity and quasi-identifier.
The main problem with k-anonymity it is that it attempts avoid privacy breaches via purely syntactic manipulations to the data, without any model for reasoning about the ‘adversary’ or attacker. A future post will analyze the limitations of k-anonymity in more detail. ‘Quasi-identifier’ is a notion that arises from attempting to see some attributes (such as ZIP code) but not others (such as tastes and behavior) as contributing to re-identifiability. However, the major lesson from the re-identification papers of the last few years has been that any information at all about a person can be potentially used to aid re-identification.
Movie ratings and noise. Let’s move on to other connections that turned out to be red herrings. Prior to our Netflix paper, Frankowski et al. studied de-anonymization of users via movie ratings collected as part of the GroupLens research project. Their algorithm achieved some success, but failed when noise was added to the auxiliary information. I believe this to be because the authors modeled re-identification as a search problem (I have no way to know if that was their mental model, but the algorithms they came up with seem inspired by the search literature.)
What does it mean to view re-identification as a search problem? A user’s anonymized movie preference record is treated as the collection of words on a web page, and the auxiliary information (another record of movie preferences, from a different database) is treated as a list of search terms. The reason this approach fails is that in the movie context, users typically enter distinct, albeit overlapping, sets of information into different sites or sources. This leads to a great deal of ‘noise’ that the algorithm must deal with. While noise in web pages is of course an issue for web search, noise in the search terms themselves is not. That explains why search algorithms come up short when applied to re-identification.
The robustness against noise was the key distinguishing element that made the re-identification attack in the Netflix paper stand out from most previous work. Any re-identification attack that goes beyond Sweeney-style demographic attributes must incorporate this as a key feature. ‘Fuzzy’ matching is tricky, and there is no universal algorithm that can be used. Rather, it needs to be tailored to the type of dataset based on an understanding of human behavior.
Hope for authorship recognition. Now for my final example. I’m collaborating with other researchers, including John Bethencourt and Emil Stefanov, on some (currently exploratory) investigations into authorship recognition (see my post on De-anonymizing the Internet). We’ve been wondering why progress in existing papers seems to hit a wall at around 100 authors, and how we can break past this limit and carry out de-anonymization on a truly Internet scale. My conjecture is that most previous papers hit the wall because they framed authorship recognition as a classification problem, which is probably the right model for forensics applications. For breaking Internet anonymity, however, this model is not appropriate.
In a de-anonymization problem, if you only succeed for some fraction of the authors, but you do so in a verifiable way, i.e, your algorithm either says “Here is the identity of X” or “I am unable to de-anonymize X”, that’s great. In a classification problem, that’s not acceptable. Further, in de-anonymization, if we can reduce the set of candidate identities for X from a million to (say) 10, that’s fantastic. In a classification problem, that’s a 90% error rate.
These may seem like minor differences, but they radically affect the variety of features that we are able to use. We can throw in a whole lot of features that only work for some authors but not for others. This is why I believe that Internet-scale text de-anonymization is fundamentally possible, although it will only work for a subset of users that cannot be predicted beforehand.
Re-identification science. Paul Ohm refers to what I and other researchers do as “re-identification science.” While this is flattering, I don’t think we’ve done enough to deserve the badge. But we need to change that, because efforts to understand re-identification algorithms by reducing them to known paradigms have been unsuccessful, as I have shown in this post. Among other things, we need to better understand the theoretical limits of anonymization and to extract the common principles underlying the more complex re-identification techniques developed in recent years.
Thanks to Vitaly Shmatikov for reviewing an earlier draft of this post.
This is the first in a series of loosely related posts that will lead up to an announcement of new research results.
Asymptotic complexity is often very useful in succintly expressing how hard a problem is. Occasionally, though, it can obfuscate the picture. Graph isomorphism is a perfect example.
Many people mistakenly believe that graph isomorphism (GI) is hard — either NP-hard, or hard enough to be insoluble in practice for large problem sizes. Certainly the complexity of the best known algorithm — exp(O(sqrt(n log n))), due to Babai and Luks — would appear to support this belief. However, the truth is that GI belongs to its own complexity class, not known to be NP-hard nor known to be soluble in polynomial time. More importantly, on real inputs, graph isomorphism is ridiculously easy. So easy that real-life solvers can plow through hundreds of thousands of instances per second.
Why is this the case? It turns out that on random graphs, GI is solvable in a very straightforward way. You only need to look at the local structure of each node — due to the randomness, every node has a distinctive neighborhood. Thus, the complexity of the worst case input and the average case input are on opposite ends of a spectrum. To see where “real” graphs fall on this spectrum, note that generative models of real graphs have a regular part and a random part. The regular part is captured in rules such as preferential attachment, but such rules only induce a probability distribution; there is still quite a bit of coin tossing needed to generate each edge. Intuitively, if there is “enough” randomness in the neighborhood of each node, then you only need to look at the local structure to tell different nodes apart. Thus, real-world graphs behave pretty much like random graphs.
It gets better: one of the uses of nauty, the leading graph isomorphism solver, is to find hard instances of graph isomorphism. The way this is done is by generating hundreds of candidate hard instances, solving them, and picking the ones that are the hardest to solve. (These are usually highly specialized graphs which are strongly regular and have fearful symmetry groups.) It would appear, then, that finding hard inputs for GI is GI-hard! I wonder if this property can be formally defined, and if there are other known problems for which this is true. This is a similar notion to Impagliazzo’s Heuristica, a world where NP != P, but for any problem in NP, and for inputs drawn from any samplable distribution (i.e, for any “real-world” input), there exists a polynomial time algorithm to decide it.
While I find this an interesting theory problem, and would love to hear opinions on it from theorists, my reason for posting this is to point out that with all the current evidence, graph isomorphism can be assumed to be solvable in randomized polynomial time for any input that you would actually encounter in reality.