This is a map of essentially everything I’ve written here on 33bits.org (up until April 2011), arranged into coherent threads. It should help you take a high-level look and quickly zoom in on what you’re interested in. The sections are arranged roughly in the order of increasing technical depth, which means that the core topic of deanonymization is actually at the end.
See also: About 33 bits
The policy question most pertinent to this blog is what to do about the fact that anonymization as a privacy-protection paradigm is broken, a problem with no purely technical solutions. Myths and Fallacies of “Personally Identifiable Information” summarizes an ACM Viewpoints column in which Shmatikov and I argue that the over-reliance on “PII” in regulation is harmful. Netflix, incidentally, seems to have illustrated what not to do: vehemently deny for years that there are problems with anonymization, and then scrap their second contest altogether when they came under scrutiny. Shmatikov and I wrote an open letter calling them out on this.
To regulate or not to regulate? A very common question, and I’ve frequently argued both sides of it in different contexts. In Privacy and the Market for Lemons, or How Websites Are Like Used Cars, I explain how information asymmetry has left consumers unable to make meaningful privacy decisions online, and that the market won’t to fix itself without a regulator. In The Unsung Success of CAN-SPAM I defend a frequent whipping boy of anti-regulationists and show that the Act has been far more successful than generally thought.
On the other hand, regulatory approaches are often proposed that don’t make sense in the light of technological reality. In The Internet has no Delete Button: Limits of the Legal System in Protecting Anonymity, I argue against the “right to be forgotten.” Equally, I believe that regulation in the absence of clear evidence of market failure is likely to hurt, which is why I disagree with Tim Wu’s central claim in his book The Master Switch.
A couple of expository articles intended to inform regulation: Do Not Track Explained covers many of the technological, business and policy issues around the emerging Do Not Track standard. In The Secret Life of Data, I argue that not being able to track data flows between companies is not a reason to imagine that no harm occurs from reidentification and data breaches.
Occasionally I analyze specific pieces of proposed legislation/regulation. When there was a widespread privacy outcry around an Oklahoma abortion reporting law, I showed that the deanonymization concerns were based on a misreading of the wording of the text. In another instance, Jonathan Mayer and I analyzed a draft version of the National Strategy for Trusted Identities in Cyberspace (NSTIC).
Turning to ethical issues, I’m fascinated by the dark side of anonymity—how anonymous speech online completely strips us of thousands of years of evolved social norms and civility. Focusing on the problem of rampant sexism, I argue that most websites should disable unmoderated anonymous comments. Another ethical issue is about whether research on deanonymization is a net benefit or a detriment to society. You can probably guess what my answer is.
A hard-to-classify post that I am fond of: In which I come out: Notes from the FTC Privacy Roundtable. After participating in the roundtable in early 2010, I decided to get more actively involved in policy and developed ambitions of punditry, whereas earlier I stuck to the math and avoided anything controversial. In this post I announced my new intentions and shared my notes from the event. In a similar vein, I wrote up my first impressions of the Washington D.C. policy world after being invited to participate in a panel, and argued that more academics need to get involved in policy.
I believe that computer science has gotten online privacy seriously wrong, and in order for privacy protection to be effective, developers need to import and synthesize ideas from diverse fields such as law, psychology, economics and human-computer interaction. Several of my posts have explored facets of this theme.
The fact that making public data “more public” can be a privacy violation — often serious enough to shut a product or even a company down — is the best sign that something is wrong with the computer science model of data privacy, which does not recognize “degrees” of publicness. In What Every Developer Needs to Know About “Public” Data and Privacy, I provide a taxonomy of what “more public” means. Web Crawlers and Privacy: The Need to Reboot Robots.txt, coauthored with Pete Warden, looks at one specific problem that arises from this lack of granularity in specifying privacy controls.
Computer science focuses too much on access control. The root cause is “adversarial thinking”, an analytical paradigm that views users as all-powerful “adversaries” (limited only by the fundamental computational limits of nature), typically interested in learning as much information about everyone as possible. This works great in the realms of cryptography and computer security but breaks down when it comes to data privacy.
Privacy is generally not something that directly affects the bottom line; usually the reason companies care about it is PR. Watching privacy pitfalls affect a company’s reputation can be morbidly fascinating. Facebook, Privacy, Public Opinion and Pitchforks was written during Facebook’s 2010 “summer of discontent” when things got so bad the company lost control of the narrative and the stories started writing themselves.
More in the vein of privacy blunders resulting in negative public opinion, it’s hard to forget the Google Buzz trainwreck. On the other hand, when a company does privacy right, it is equally if not more important to study it and learn from it. I argue that Livejournal has gotten privacy largely right and I describe how they did it.
I think a lot about the role of cryptography in privacy protection. In my paper on Location Privacy via Private Proximity Testing with Thiagarajan, Lakhani, Hamburg and Boneh, we tried some new things—rather than just prove mathematical theorems about cryptographic techniques, we produced a working Android implementation and attempted to get tech companies to adopt it, which gave me insights into what the real-world stumbling blocks are. We also came up with a key-distribution mechanism that discards the assumptions behind traditional Public-Key Infrastructure and instead leverages social networks, which we believe is much better suited to today’s world. This is still very much a work in progress.
Finally, an article on how the dominant paradigm in the mathematical/statistical study of data privacy went from “utility-first” to “privacy-first” and how the “differential privacy” notion won over its opponents. I find that an understanding of the social processes behind research and research communities can be very helpful in understanding the subject matter itself.
In the beginning web browsing was truly anonymous. The cookie and a variety of other technologies made it possible to track users, i.e., deduce that the same person visited certain sites at certain times. However, the sites doing the tracking don’t know who you are, i.e., you name, etc., unless you choose to tell them in some way, such as by logging in.
That is now changing. We are quickly moving to a web where websites actually know your identity as you browse around. Cookies, Supercookies and Ubercookies looks at the basics and explains a recent paper that proposed a (complicated) attack for websites to learn your identity based on the history stealing bug. In Ubercookies Part 2: History Stealing Meets the Social Web, I show a much simpler version of this attack.
Since then, the pace has picked up. In Feb 2010 I revealed a bug in Google spreadsheets that can be exploited by an arbitrary website to learn your identity. Google fixed the bug, but new ones kept coming, such as this one that exploited a bug in the Firefox error object.
Facebook’s Instant personalization contributed to the problem in multiple ways. First, the partner websites’ implementations were full of security holes, which can again be exploited by an arbitrary website that you visit to — you guessed it — learn your identity. Second, by being so pervasive, it is making users get inured to the idea that their browser is telling websites who they are and there’s nothing they can do about it. In other words, creeping normalcy.
Is this loss of anonymity/pseudonymity just something that makes people uncomfortable or can things go seriously wrong? One way they can is this: black-hats operating websites can now utilize the knowledge of the visitor/victim’s identity in order to make their social-engineering scams devastatingly effective. See One Click Frauds and Identity Leakage: Two Trends on a Collision Course. That said, the grey-hat threat is possibly more worrisome than black-hat—any number of companies, large and small, under pressure to improve their bottom line would jump at a chance “customize the user experience” or “deliver better targeting” by deanonymizing their visitors through bugs such as history stealing.
The next frontier in the loss of anonymity is probably going to be smartphones. Combined with the knowledge of the user’s location, it’s a surveillance dream-come-true. I haven’t yet written about this but hope to in the near future.
The bread-and-butter topic of this blog is the technical exposition of de-anonymization. I have mostly reported on my own work, but occasionally others’.
My Netflix deanonymization paper with Shmatikov from 2006 showed how poorly “high-dimensional” data resists deanonymization. High-dimensional data is rich data such as longitudinal records of movie ratings—the average subscriber has over 200 ratings in the Netflix dataset. The paper paved the way for a wave of deanonymization results over the next few years.
While the Netflix paper established the basic techniques, much of the subsequent work has involved studying different types of anonymized data that need to be sliced in different ways to make them yield. Perhaps the most interesting domain is graphs of social networks. Again with Shmatikov, I showed in 2008 how to take two graphs representing social networks and map the nodes to each other based on the graph structure alone—no usernames, no nothing. In 2011, teaming up with Shi and Rubinstein, I used the techniques in that paper along with some new ones to win the Kaggle social network challenge.
A domain that has been in the news lately is location data. In Your Morning Commute is Unique: On the Anonymity of Home/Work Location Pairs, I explain the deanonymization result of Golle and Partridge showing how most people are uniquely identified by their approximate home and work locations. I have to agree with Golle’s assessment that there is no domain of data so utterly hopeless to try to anonymize as location data, both due to the high entropy of location coordinates and the easy availability of auxiliary data to use for deanonymization.
Deanonymizing the Internet looks at the possibility of identifying the author of a piece of anonymously posted text online by matching the writing style with something they may have written elsewhere on the Internet. This is well-known if the number of possible candidate authors is small, but Internet-scale is a different beast altogether. Stay tuned for research results on this question.
Perito and others developed a mathematical model for linking people together based on their usernames alone. In a related vein, I used screen names and other techniques to show that the majority of Lending Club loan applicants represented the company’s anonymized data release are vulnerable to deanonymization.
Genetic anonymity is a recent research area of mine. I have a detailed analysis of how much entropy there is in a DNA profile stored in law-enforcement databases, and whether this is enough to uniquely identify a person.
In lighthearted-yet-serious articles, I analyze the futility of trying to keep an entire institution anonymous when releasing data about its members, and the privacy dangers of providing a urine sample, motivated by the manner in which A-Rod got caught for steroids.
Deanonymization is related to many well-studied problems from other areas of information science such as record linkage, classification in machine learning, and biometric identification. Noticing such connections is often very illuminating but I argue that in the case of reidentification, these connections have done more harm than good. Therefore we need to develop a “science of reidentification” from the ground up.
Finally, a couple of technical tidbits: 1. Eccentricity is a way for a deanonymization algorithm to test its own output. We developed it as part of the Netflix paper but it has proved useful in a variety of problems over the years. 2. Social network deanonymization resembles the graph isomorphism problem, whose complexity, I argue, is horribly and widely misunderstood.