Posts tagged ‘de-anonymization’
What should we do about re-identification? Back when I started this blog in grad school seven years ago, I subtitled it “The end of anonymous data and what to do about it,” anticipating that I’d work on re-identification demonstrations as well as technical and policy solutions. As it turns out, I’ve looked at the former much more often than the latter. That said, my recent paper A Precautionary Approach to Big Data Privacy with Joanna Huey and Ed Felten tackles the “what to do about it” question head-on. We present a comprehensive set of recommendations for policy makers and practitioners.
One more re-identification demonstration, and then I’m out. Overall, I’ve moved on in terms of my research interests to other topics like web privacy and cryptocurrencies. That said, there’s one fairly significant re-identification demonstration I hope to do some time this year. This is something I started in grad school, obtained encouraging preliminary results on, and then put on the back burner. Stay tuned.
Machine learning and re-identification. I’ve argued that the algorithms used in re-identification turn up everywhere in computer science. I’m still interested in these algorithms from this broader perspective. My recent collaboration on de-anonymizing programmers using coding style is a good example. It uses more sophisticated machine learning than most of my earlier work on re-identification, and the potential impact is more in forensics than in privacy.
Privacy and ethical issues in big data. There’s a new set of thorny challenges in big data — privacy-violating inferences, fairness of machine learning, and ethics in general. I’m collaborating with technology ethics scholar Solon Barocas on these topics. Here’s an abstract we wrote recently, just to give you a flavor of what we’re doing:
How to do machine learning ethically
Every now and then, a story about inference goes viral. You may remember the one about Target advertising to customers who were determined to be pregnant based on their shopping patterns. The public reacts by showing deep discomfort about the power of inference and says it’s a violation of privacy. On the other hand, the company in question protests that there was no wrongdoing — after all, they had only collected innocuous information on customers’ purchases and hadn’t revealed that data to anyone else.
This common pattern reveals a deep disconnect between what people seem to care about when they cry privacy foul and the way the protection of privacy is currently operationalized. The idea that companies shouldn’t make inferences based on data they’ve legally and ethically collected might be disturbing and confusing to a data scientist.
And yet, we argue that doing machine learning ethically means accepting and adhering to boundaries on what’s OK to infer or predict about people, as well as how learning algorithms should be designed. We outline several categories of inference that run afoul of privacy norms. Finally, we explain why ethical considerations sometimes need to be built in at the algorithmic level, rather than being left to whoever is deploying the system. While we identify a number of technical challenges that we don’t quite know how to solve yet, we also provide some guidance that will help practitioners avoid these hazards.
Ed Felten and I recently wrote a response to a poorly reasoned defense of data anonymization. This doesn’t mean, however, that there’s never a place for anonymization. Here’s my personal view on some good and bad reasons for anonymizing data before sharing it.
Good: We’re using anonymization to keep honest people honest. We’re only providing the data to insiders (employees) or semi-insiders (research collaborators), and we want to help them resist the temptation to peep.
Probably good: We’re sharing data only with a limited set of partners. These partners have a reputation to protect; they have also signed legal agreements that specify acceptable uses, retention periods, and audits.
Possibly good: We de-identified the data at a big cost in utility — for example, by making high-dimensional data low-dimensional via “vertical partitioning” — but it still enables some useful data analysis. (There are significant unexplored research questions here, and technically sound privacy guarantees may be possible.)
Reasonable: The data needed to be released no matter what; techniques like differential privacy didn’t produce useful results on our dataset. We released de-identified data and decided to hope for the best.
Reasonable: The auxiliary data needed for de-anonymization doesn’t currently exist publicly and/or on a large scale. We’re acting on the assumption that it won’t materialize in a relevant time-frame and are willing to accept the risk that we’re wrong.
Ethically dubious: The privacy harm to individuals is outweighed by the greater good to society. Related: de-anonymization is not as bad as many other privacy risks that consumers face.
Sometimes plausible: The marginal benefit of de-anonymization (compared to simply using the auxiliary dataset for marketing or whatever purpose) is so low that even the small cost of skilled effort is a sufficient deterrent. Adversaries will prefer other means of acquiring equivalent data — through purchase, if they are lawful, or hacking, if they’re not.[*]
Bad: Since there aren’t many reports of de-anonymization except research demonstrations, it’s safe to assume it isn’t happening.
It’s surprising how often this argument is advanced considering that it’s a complete non-sequitur: malfeasors who de-anonymize are obviously not going to brag about it. The next argument is a self-interested version takes this fact into account.
Dangerously rational: There won’t be a PR fallout from releasing anonymized data because researchers no longer have the incentive for de-anonymization demonstrations, whereas if malfeasors do it they won’t publicize it (elaborated here).
Bad: The expertise needed for de-anonymization is such a rare skill that it’s not a serious threat (addressed here).
Bad: We simulated some attacks and estimated that only 1% of records are at risk of being de-anonymized. (Completely unscientific; addressed here.)
Qualitative risk assessment is valuable; quantitative methods can be a useful heuristic to compare different choices of anonymization parameters if one has already decided to release anonymized data for other reasons, but can’t be used as a justification of the decision.
[*] This is my restatement of one of Yakowitz’s arguments in Tragedy of the Data Commons.
What really drives reidentification researchers? Do we publish these demonstrations to alert individuals to privacy risks? To shame companies? For personal glory? If our goal is to improve privacy, are we doing it in the best way possible?
In this post I’d like to discuss my own motivations as a reidentification researcher, without speaking for anyone else. Certainly I care about improving privacy outcomes, in the sense of making sure that companies, governments and others don’t get away with mathematically unsound promises about the privacy of consumers’ data. But there is a quite different goal I care about at least as much: reidentification algorithms. These algorithms are my primary object of study, and so I see reidentification research partly as basic science.
Let me elaborate on why reidentification algorithms are interesting and important. First, they yield fundamental insights about people — our interests, preferences, behavior, and connections — as reflected in the datasets collected about us. Second, as is the case with most basic science, these algorithms turn out to have a variety of applications other than reidentification, both for good and bad. Let us consider some of these.
First and foremost, reidentification algorithms are directly applicable in digital forensics and intelligence. Analyzing the structure of a terrorist network (say, based on surveillance of movement patterns and meetings) to assign identities to nodes is technically very similar to social network deanonymization. A reidentification researcher that I know who is a U.S. citizen tells me he has been contacted more than once by intelligence agencies to apply his expertise to their data.
Homer et al’s work on identifying individuals in DNA mixtures is another great example of how forensics algorithms are inextricably linked to privacy-infringing applications. In addition to DNA and network structure, writing style and location trails are other attributes that have been utilized both in reidentification and forensics.
It is not a coincidence that the reidentification literature often uses the word “fingerprint” — this body of work has generalized the notion of a fingerprint beyond physical attributes to a variety of other characteristics. Just like physical fingerprints, there are good uses and bad, but regardless, finding generalized fingerprints is a contribution to human knowledge. A fundamental question is how much information (i.e., uniqueness) there is in each of these types of attributes or characteristics. Reidentification research is gradually helping answer this question, but much remains unknown.
It is not only people that are fingerprintable — so are various physical devices. A wonderful set of (unrelated) research papers has shown that many types of devices, objects, and software systems, even supposedly identical ones, are have unique fingerprints: blank paper, digital cameras, RFID tags, scanners and printers, and web browsers, among others. The techniques are similar to reidentification algorithms, and once again straddle security-enhancing and privacy-infringing applications.
Even more generally, reidentification algorithms are classification algorithms for the case when the number of classes is very large. Classification algorithms categorize observed data into one of several classes, i.e., categories. They are at the core of machine learning, but typical machine-learning applications rarely need to consider more than several hundred classes. Thus, reidentification science is helping develop our knowledge of how best to extend classification algorithms as the number of classes increases.
Moving on, research on reidentification and other types of “leakage” of information reveals a problem with the way data-mining contests are run. Most commonly, some elements of a dataset are withheld, and contest participants are required to predict these unknown values. Reidentification allows contestants to bypass the prediction process altogether by simply “looking up” the true values in the original data! For an example and more elaborate explanation, see this post on how my collaborators and I won the Kaggle social network challenge. Demonstrations of information leakage have spurred research on how to design contests without such flaws.
If reidentification can cause leakage and make things messy, it can also clean things up. In a general form, reidentification is about connecting common entities across two different databases. Quite often in real-world datasets there is no unique identifier, or it is missing or erroneous. Just about every programmer who does interesting things with data has dealt with this problem at some point. In the research world, William Winkler of the U.S. Census Bureau has authored a survey of “record linkage”, covering well over a hundred papers. I’m not saying that the high-powered machinery of reidentification is necessary here, but the principles are certainly useful.
In my brief life as an entrepreneur, I utilized just such an algorithm for the back-end of the web application that my co-founders and I built. The task in question was to link a (musical) artist profile from last.fm to the corresponding Wikipedia article based on discography information (linking by name alone fails in any number of interesting ways.) On another occasion, for the theory of computing blog aggregator that I run, I wrote code to link authors of papers uploaded to arXiv to their DBLP profiles based on the list of coauthors.
There is more, but I’ll stop here. The point is that these algorithms are everywhere.
If the algorithms are the key, why perform demonstrations of privacy failures? To put it simply, algorithms can’t be studied in a vacuum; we need concrete cases to test how well they work. But it’s more complicated than that. First, as I mentioned earlier, keeping the privacy conversation intellectually honest is one of my motivations, and these demonstrations help. Second, in the majority of cases, my collaborators and I have chosen to examine pairs of datasets that were already public, and so our work did not uncover the identities of previously anonymous subjects, but merely helped to establish that this could happen in other instances of “anonymized” data sharing.
Third, and I consider this quite unfortunate, reidentification results are taken much more seriously if researchers do uncover identities, which naturally gives us an incentive to do so. I’ve seen this in my own work — the Netflix paper is the most straightforward and arguably the least scientifically interesting reidentification result that I’ve co-authored, and yet it received by far the most attention, all because it was carried out on an actual dataset published by a company rather than demonstrated hypothetically.
My primary focus on the fundamental research aspect of reidentification guides my work in an important way. There are many, many potential targets for reidentification — despite all the research, data holders often (rationally) act like nothing has changed and continue to make data releases with “PII” removed. So which dataset should I pick to work on?
Focusing on the algorithms makes it a lot easier. One of my criteria for picking a reidentification question to work on is that it must lead to a new algorithm. I’m not at all saying that all reidentification researchers should do this, but for me it’s a good way to maximize the impact I can hope for from my research, while minimizing controversies about the privacy of the subjects in the datasets I study.
I hope this post has given you some insight into my goals, motivations, and research outputs, and an appreciation of the fact that there is more to reidentification algorithms than their application to breaching privacy. It will be useful to keep this fact in the back of our minds as we continue the conversation on the ethics of reidentification.
Thanks to Vitaly Shmatikov for reviewing a draft.
I have a new paper appearing at IEEE S&P with Hristo Paskov, Neil Gong, John Bethencourt, Emil Stefanov, Richard Shin and Dawn Song on Internet-scale authorship identification based on stylometry, i.e., analysis of writing style. Stylometric identification exploits the fact that we all have a ‘fingerprint’ based on our stylistic choices and idiosyncrasies with the written word. To quote from my previous post speculating on the possibility of Internet-scale authorship identification:
Consider two words that are nearly interchangeable, say ‘since’ and ‘because’. Different people use the two words in a differing proportion. By comparing the relative frequency of the two words, you get a little bit of information about a person, typically under 1 bit. But by putting together enough of these ‘markers’, you can construct a profile.
The basic idea that people have distinctive writing styles is very well-known and well-understood, and there is an extremely long line of research on this topic. This research began in modern form in the early 1960s when statisticians Mosteller and Wallace determined the authorship of the disputed Federalist papers, and were featured in TIME magazine. It is never easy to make a significant contribution in a heavily studied area. No surprise, then, that my initial blog post was written about three years ago, and the Stanford-Berkeley collaboration began in earnest over two years ago.
Impact. So what exactly did we achieve? Our research has dramatically increased the number of authors that can be distinguished using writing-style analysis: from about 300 to 100,000. More importantly, the accuracy of our algorithms drops off gently as the number of authors increases, so we can be confident that they will continue to perform well as we scale the problem even further. Our work is therefore the first time that stylometry has been shown to have to have serious implications for online anonymity.
Anonymity and free speech have been intertwined throughout history. For example, anonymous discourse was essential to the debates that gave birth to the United States Constitution. Yet a right to anonymity is meaningless if an anonymous author’s identity can be unmasked by adversaries. While there have been many attempts to legally force service providers and other intermediaries to reveal the identity of anonymous users, courts have generally upheld the right to anonymity. But what if authors can be identified based on nothing but a comparison of the content they publish to other web content they have previously authored?
Experiments. Our experimental methodology is set up to directly address this question. Our primary data source was the ICWSM 2009 Spinn3r Blog Dataset, a large collection of blog posts made available to researchers by Spinn3r.com, a provider of blog-related commercial data feeds. To test the identifiability of an author, we remove a random k (typically 3) posts from the corresponding blog and treat it as if those posts are anonymous, and apply our algorithm to try to determine which blog it came from. In these experiments, the labeled (identified) and unlabled (anonymous) texts are drawn from the same context. We call this post-to-blog matching.
In some applications of stylometric authorship recognition, the context for the identified and anonymous text might be the same. This was the case in the famous study of the federalist papers — each author hid his name from some of his papers, but wrote about the same topic. In the blogging scenario, an author might decide to selectively distribute a few particularly sensitive posts anonymously through a different channel. But in other cases, the unlabeled text might be political speech, whereas the only available labeled text by the same author might be a cooking blog, i.e., the labeled and unlabeled text might come from different contexts. Context encompasses much more than topic: the tone might be formal or informal; the author might be in a different mental state (e.g., more emotional) in one context versus the other, etc.
We feel that it is crucial for authorship recognition techniques to be validated in a cross-context setting. Previous work has fallen short in this regard because of the difficulty of finding a suitable dataset. We were able to obtain about 2,000 pairs (and a few triples, etc.) of blogs, each pair written by the same author, by looking at a dataset of 3.5 million Google profiles and searching for users who listed more than one blog in the ‘websites’ field. We are thankful to Daniele Perito for sharing this dataset. We added these blogs to the Spinn3r blog dataset to bring the total to 100,000. Using this data, we performed experiments as follows: remove one of a pair of blogs written by the same author, and use it as unlabeled text. The goal is to find the other blog written by the same author. We call this blog-to-blog matching. Note that although the number of blog pairs is only a few thousand, we match each anonymous blog against all 99,999 other blogs.
Results. Our baseline result is that in the post-to-blog experiments, the author was correctly identified 20% of the time. This means that when our algorithm uses three anonymously published blog posts to rank the possible authors in descending order of probability, the top guess is correct 20% of the time.
But it gets better from there. In 35% of cases, the correct author is one of the top 20 guesses. Why does this matter? Because in practice, algorithmic analysis probably won’t be the only step in authorship recognition, and will instead be used to produce a shortlist for further investigation. A manual examination may incorporate several characteristics that the automated analysis does not, such as choice of topic (our algorithms are scrupulously “topic-free”). Location is another signal that can be used: for example, if we were trying to identify the author of the once-anonymous blog Washingtonienne we’d know that she almost certainly resides in or around Washington, D.C. Alternately, a powerful adversary such as law enforcement may require Blogger, WordPress, or another popular blog host to reveal the login times of the top suspects, which could be correlated with the timing of posts on the anonymous blog to confirm a match.
We can also improve the accuracy significantly over the baseline of 20% for authors for whom we have more than an average number of labeled or unlabeled blog posts. For example, with 40–50 labeled posts to work with (the average is 20 posts per author), the accuracy goes up to 30–35%.
An important capability is confidence estimation, i.e., modifying the algorithm to also output a score reflecting its degree of confidence in the prediction. We measure the efficacy of confidence estimation via the standard machine-learning metrics of precision and recall. We find that we can improve precision from 20% to over 80% with only a halving of recall. In plain English, what these numbers mean is: the algorithm does not always attempt to identify an author, but when it does, it finds the right author 80% of the time. Overall, it identifies 10% (half of 20%) of authors correctly, i.e., 10,000 out of the 100,000 authors in our dataset. Strong as these numbers are, it is important to keep in mind that in a real-life deanonymization attack on a specific target, it is likely that confidence can be greatly improved through methods discussed above — topic, manual inspection, etc.
We confirmed that our techniques work in a cross-context setting (i.e., blog-to-blog experiments), although the accuracy is lower (~12%). Confidence estimation works really well in this setting as well and boosts accuracy to over 50% with a halving of recall. Finally, we also manually verified that in cross-context matching we find pairs of blogs that are hard for humans to match based on topic or writing style; we describe three such pairs in an appendix to the paper. For detailed graphs as well as a variety of other experimental results, see the paper.
We see our results as establishing early lower bounds on the efficacy of large-scale stylometric authorship recognition. Having cracked the scale barrier, we expect accuracy improvements to come easier in the future. In particular, we report experiments in the paper showing that a combination of two very different classifiers works better than either, but there is a lot more mileage to squeeze from this approach, given that ensembles of classifiers are known to work well for most machine-learning problems. Also, there is much work to be done in terms of analyzing which aspects of writing style are preserved across contexts, and using this understanding to improve accuracy in that setting.
Techniques. Now let’s look in more detail at the techniques I’ve hinted at above. The author identification task proceeds in two steps: feature extraction and classification. In the feature extraction stage, we reduce each blog post to a sequence of about 1,200 numerical features (a “feature vector”) that acts as a fingerprint. These features fall into various lexical and grammatical categories. Two example features: the frequency of uppercase words, the number of words that occur exactly once in the text. While we mostly used the same set of features that the authors of the Writeprints paper did, we also came up with a new set of features that involved analyzing the grammatical parse trees of sentences.
An important component of feature extraction is to ensure that our analysis was purely stylistic. We do this in two ways: first, we preprocess the blog posts to filter out signatures, markup, or anything that might not be directly entered by a human. Second, we restrict our features to those that bear little resemblance to the topic of discussion. In particular, our word-based features are limited to stylistic “function words” that we list in an appendix to the paper.
In the classification stage, we algorithmically “learn” a characterization of each author (from the set of feature vectors corresponding to the posts written by that author). Given a set of feature vectors from an unknown author, we use the learned characterizations to decide which author it most likely corresponds to. For example, viewing each feature vector as a point in a high-dimensional space, the learning algorithm might try to find a “hyperplane” that separates the points corresponding to one author from those of every other author, and the decision algorithm might determine, given a set of hyperplanes corresponding to each known author, which hyperplane best separates the unknown author from the rest.
We made several innovations that allowed us to achieve the accuracy levels that we did. First, contrary to some previous authors who hypothesized that only relatively straightforward “lazy” classifiers work for this type of problem, we were able to avoid various pitfalls and use more high-powered machinery. Second, we developed new techniques for confidence estimation, including a measure very similar to “eccentricity” used in the Netflix paper. Third, we developed techniques to improve the performance (speed) of our classifiers, detailed in the paper. This is a research contribution by itself, but it also enabled us to rapidly iterate the development of our algorithms and optimize them.
In an earlier article, I noted that we don’t yet have as rigorous an understanding of deanonymization algorithms as we would like. I see this paper as a significant step in that direction. In my series on fingerprinting, I pointed out that in numerous domains, researchers have considered classification/deanonymization problems with tens of classes, with implications for forensics and security-enhancing applications, but that to explore the privacy-infringing/surveillance applications the methods need to be tweaked to be able to deal with a much larger number of classes. Our work shows how to do that, and we believe that insights from our paper will be generally applicable to numerous problems in the privacy space.
Concluding thoughts. We’ve thrown open the doors for the study of writing-style based deanonymization that can be carried out on an Internet-wide scale, and our research demonstrates that the threat is already real. We believe that our techniques are valuable by themselves as well.
The good news for authors who would like to protect themselves against deanonymization, it appears that manually changing one’s style is enough to throw off these attacks. Developing fully automated methods to hide traces of one’s writing style remains a challenge. For now, few people are aware of the existence of these attacks and defenses; all the sensitive text that has already been anonymously written is also at risk of deanonymization.
 A team from Israel have studied authorship recognition with 10,000 authors. While this is interesting and impressive work, and bears some similarities with ours, they do not restrict themselves to stylistic analysis, and therefore the method is comparatively limited in scope. Incidentally, they have been in the news recently for some related work.
 Although the fraction of users who listed even a single blog in their Google profile was small, there were more than 2,000 users who listed multiple. We did not use the full number that was available.
The previous article looked at how pieces of blank paper can be uniquely identified. This article continues the fingerprinting theme to another domain, digital cameras, and ends by speculating on the possibility of applying the technique on an Internet-wide scale.
For various kinds of devices like digital cameras and RFID chips, even supposedly identical units that come out of a manufacturing plant behave slightly differently in characteristic ways, and can therefore be distinguished based on their output or behavior. How could this be? The unifying principle is this:
Digital camera identification belongs to a class of techniques that exploits ‘pattern noise’ in the ‘sensor arrays’ that capture images. The same techniques can be used to fingerprint a scanner by analyzing pixel-level patterns in the images scanned by it, but that’ll be the focus of a later article.
A long-exposure dark frame [source]. Click image to see full size. Three ‘hot pixels’ and some other sensor noise can be seen.
A photo taken in the absence of any light doesn’t look completely black; a variety of factors introduce noise. There is random noise that varies in every image, but there is also ‘pattern noise’ due to inherent structural defects or irregularities in the physical sensor array. The key property of the latter kind of noise is that it manifests the same way every image taken by the camera. Thus, the total noise vector produced by a camera is not identical between images, nor is it completely independent.
Nevertheless, separating the pattern noise from random noise and the image itself — after all, a good camera will seek to minimize the strength or ‘power’ of the noise in relation to the image — is a very difficult task, and is the primary technical challenge that camera fingerprinting techniques must address.
Security vs. privacy. A quick note about the applications of camera fingerprinting. We saw in the previous article that there are security-enhancing and privacy-infringing applications of document fingerprinting. In fact, this is almost always the case with fingerprinting techniques. 
Camera fingerprinting can be used on the one hand for detecting forgeries (e.g., photoshopped images), and to aid criminal investigations by determining who (or rather, which camera) might have taken a picture. On the other hand, it could potentially also be used for unmasking individuals who wish to disseminate photos anonymously online.
Sadly, most papers studying fingerprinting study only the former type of application, which is why we’ll have to speculate a bit on the privacy impact, even though the underlying math of fingerprinting is the same.
Another point to note is that because of the focus on forensics, most of the work in this area so far has studied distinguishing different camera models. But there are some preliminary results on distinguishing ‘identical’ cameras, and it appears that the same techniques will work.
In more detail. Let’s look at what I think is the most well-known paper on sensor pattern noise fingerprinting, by Binghamton University researchers Jan Lukáš, Jessica Fridrich, and Miroslav Golja.  Here’s how it works: the first step is to build a reference pattern of a camera from multiple known images taken from it, so that later an unsourced image can be compared against these reference patterns. The authors suggest using at least 50, but for good measure, they use 320 in their experiments. In the forensics context, the investigator probably has physical possession of the camera and therefore can generate an unlimited number of images. We’ll discuss what this requirement means in the privacy-breach context later.
There are two steps to build the reference pattern. First, for each image, a denoising filter is applied, and the denoised image is subtracted from the original to leave only the noise. Next, the noise is averaged across all the reference images — this way the random noise cancels out and leaves the pattern noise.
Comparing a new image to a reference pattern, to test if it came from that camera, is easy: extract the noise from the test image, and compare this noise pixel-by-pixel with the reference noise. The noise from the test image includes random noise, so the match won’t be close to perfect, but nevertheless the correlation between the two noise patterns will be roughly equal to the contribution of pattern noise towards the total noise in the test image. On the other hand, if the test image didn’t come from the same camera, the correlation will be close to zero.
The authors experimented with nine cameras, of which two were from the same brand and model (Olympus Camedia C765). In addition, two other cameras had the same type of sensor. There was not a single error in their 2,700 tests, including those involving the two ‘identical’ cameras — in each case, the algorithm correctly identified which of the nine cameras a given image came from. By extrapolating the correlation curves, they conservatively estimate that for a False Accept Rate of 10-3, their method achieves a False Reject Rate of anywhere between 10-2 to 10-10 or even less depending on the camera model and camera settings.
The takeaway from this seems to be that distinguishing between cameras of different models can be performed with essentially perfect accuracy. Distinguishing between cameras of the same model also seems to have very high accuracy, but it is hard to generalize because of the small sample size.
Improvements. Impressive as the above numbers are, there are at least two major ways in which this result can, and has been improved. First, the Binghamton paper is focused on a specific signal, sensor noise. But there are several stages in image acquisition and processing pipeline in the camera, each of which could leave idiosyncratic effects on the image. This paper out of Turkey incorporates many such effects by considering all patterns of certain types that occur in the lower order (least significant) bits of the image, which seems like a rather powerful technique.
The effects other than sensor noise seem to help more with identifying the camera model than the specific device, but to the extent that the former is a component of the latter, it is useful. They achieve a 97.5% accuracy among 16 test cameras — but with cellphone cameras with pictures at a resolution of just 640×480.
Second is the effect of the scene itself on the noise. Denoising transformations are not perfect — sharp boundaries look like noise. The Binghamton researchers picked their denoising filter (a wavelet transform) to minimize this problem, but a recent paper by Chang-Tsun Li claims to do it better, and shows even better numerical results: with 6 cameras (all different models), accurate (over 99%) identification for image fragments cropped to just 256 x 512.
What does this mean for privacy? I said earlier that there is a duality between security and privacy, but let’s examine the relationship in more detail. In privacy-infringing applications like mass surveillance, the algorithm need not always produce an answer, and it can occasionally be wrong when it does. The penalty for errors is much lower. On the other hand, the matching algorithm in surveillance-like applications needs to handle a far larger number of candidate cameras. The key point is:
My intuition is that state-of-the-art techniques, configured slightly differently, should allow probabilistic deanonymization from among tens of thousands of different cameras. A Flickr or Picasa profile with a few dozen images should suffice to fingerprint a camera. Combined with metadata such as location, this puts us within striking distance of Internet-scale source-camera identification from anonymous images. I really hope there will be some serious research on this question.
Finally, a word defenses. If you find yourself in a position where you wish to anonymously publicize a sensitive photograph you took, but your camera is publicly tied to your identity because you’ve previously shared pictures on social networks (and who hasn’t), how do you protect yourself?
Compressing the image is one possibility, because that destroys the ‘lower-order’ bits that fingerprinting crucially depends on. However, it would have to be way more aggressive than most camera defaults (JPEG quality factor ~60% according to one of the studies, whereas defaults are ~95%). A different strategy is rotating the image slightly in order to ‘desynchronize’ it, throwing off the fingerprint matching. An attack that defeats this will have to be much more sophisticated and will have a far higher error rate.
The deanonymization threat here is analogous to writing-style fingerprinting: there are simple defenses, albeit not foolproof, but sadly most users are unaware of the problem, let alone solutions.
 That was a bit simplified; mathematically, there is an additive component (dark signal nonuniformity) and a multiplicative component (photoresponse nonuniformity). The former is easy to correct for, and higher-end cameras do, but the latter isn’t.
 Much has been said about the tension between security and privacy at a social/legal/political level, but I’m making a relatively uncontroversial technical statement here.
 Fridrich is incidentally one of the pioneers of speedcubing i.e., speed-solving the Rubik’s cube.
 The Binghamton paper uses 320 images per camera for building a fingerprint (and recommends at least 50); the Turkey paper uses 100, and Li’s paper 50. I suspect that if more than one image taken from the unknown camera is available, then the number of reference images can be brought down by a corresponding factor.