Posts tagged ‘data privacy’
Ed Felten and I recently wrote a response to a poorly reasoned defense of data anonymization. This doesn’t mean, however, that there’s never a place for anonymization. Here’s my personal view on some good and bad reasons for anonymizing data before sharing it.
Good: We’re using anonymization to keep honest people honest. We’re only providing the data to insiders (employees) or semi-insiders (research collaborators), and we want to help them resist the temptation to peep.
Probably good: We’re sharing data only with a limited set of partners. These partners have a reputation to protect; they have also signed legal agreements that specify acceptable uses, retention periods, and audits.
Possibly good: We de-identified the data at a big cost in utility — for example, by making high-dimensional data low-dimensional via “vertical partitioning” — but it still enables some useful data analysis. (There are significant unexplored research questions here, and technically sound privacy guarantees may be possible.)
Reasonable: The data needed to be released no matter what; techniques like differential privacy didn’t produce useful results on our dataset. We released de-identified data and decided to hope for the best.
Reasonable: The auxiliary data needed for de-anonymization doesn’t currently exist publicly and/or on a large scale. We’re acting on the assumption that it won’t materialize in a relevant time-frame and are willing to accept the risk that we’re wrong.
Ethically dubious: The privacy harm to individuals is outweighed by the greater good to society. Related: de-anonymization is not as bad as many other privacy risks that consumers face.
Sometimes plausible: The marginal benefit of de-anonymization (compared to simply using the auxiliary dataset for marketing or whatever purpose) is so low that even the small cost of skilled effort is a sufficient deterrent. Adversaries will prefer other means of acquiring equivalent data — through purchase, if they are lawful, or hacking, if they’re not.[*]
Bad: Since there aren’t many reports of de-anonymization except research demonstrations, it’s safe to assume it isn’t happening.
It’s surprising how often this argument is advanced considering that it’s a complete non-sequitur: malfeasors who de-anonymize are obviously not going to brag about it. The next argument is a self-interested version takes this fact into account.
Dangerously rational: There won’t be a PR fallout from releasing anonymized data because researchers no longer have the incentive for de-anonymization demonstrations, whereas if malfeasors do it they won’t publicize it (elaborated here).
Bad: The expertise needed for de-anonymization is such a rare skill that it’s not a serious threat (addressed here).
Bad: We simulated some attacks and estimated that only 1% of records are at risk of being de-anonymized. (Completely unscientific; addressed here.)
Qualitative risk assessment is valuable; quantitative methods can be a useful heuristic to compare different choices of anonymization parameters if one has already decided to release anonymized data for other reasons, but can’t be used as a justification of the decision.
[*] This is my restatement of one of Yakowitz’s arguments in Tragedy of the Data Commons.
Let’s take a break from the Ubercookies series. I’m at the IPAM data privacy workshop in LA, and I want to tell you about the kind of unusual scientific endeavor that it represents. I’ve recently started to write about the process of doing science, what’s good and what’s bad about it, and I expect to have more to say on this topic in this blog.
While “paradigm shift” has become a buzzword, the original sense in which Kuhn used it refers to a specific scientific process. I’ve had the rare experience of witnessing such a paradigm shift unfold, and I may even have played a small part. I am going to tell that story. I hope it will give you a “behind-the-scenes” look into how science works.
I will sidestep the question of whether data privacy is a science. I think it is a science to the extent that computer science is a science. At any rate, I think this narrative provides a nice illustration of Kuhn’s ideas.
First I need to spend some time setting up the scene and the actors. (I’m going to take some liberties and simplify things for the benefit of the broader audience, and I hope my colleagues will forgive me for it.)
The scene. Privacy research is incredibly multidisciplinary, and this workshop represents one extreme of the spectrum: the math behind data privacy. The mathematical study of privacy in databases centers on one question:
If you have a bunch of data collected from individuals, and you want to let other people do something useful with the data, such as learning correlations, how do you do it without revealing individual information?
There are roughly 3 groups that investigate this question and are represented here:
- computer scientists with a background in cryptography / theoretical CS
- computer scientists with a background in databases and data mining
This classification is neither exhaustive nor strict, but it will suffice for my current purposes.
One of the problems with science and math research is that different communities studying different aspects of the the the same problem (or even studying the same problem from different perspectives) don’t meet together very often. For one, there is a good deal of friction in overcoming the language barriers (different names/ways of thinking about the same things). For another, academics are rewarded primarily for publishing in their own communities. That is why the organizers deserve a ton of credit for bridging the barriers and getting people together.
The paradigms. There is a fundamental, inescapable tension between the utility of data and the privacy of the participants. That’s the one thing that theorists and practitioners can agree on :-) Given that fact, there are two approaches to go about building a theory of privacy-protection, which I will call utility-first and privacy-first. Statisticians and database people tend to prefer the former paradigm, and cryptographers the latter; but this is not a clean division.
Utility-first hopes to be able to preserve the statistical computations that we would want to do if we didn’t have to worry about privacy, and then ask, “how can we improve the privacy of participants while still doing all these things?” Data anonymization is one natural technique that comes out of this world view: if you are only doing simple syntactic transformations to the data, the utility of the data is not affected very much.
On the other hand, privacy-first says, “let’s first figure out a rigorously provable way to assure the privacy of participants, and then go about figuring out what are the types of computations that can be carried out under this rubric.” The community has collectively decided, with good reason, that differential privacy is the right rubric to use. To explain it properly would require many Greek symbols, so I won’t.
Privacy-first and utility-first are scientific paradigms, not theories. Neither is falsifiable. We can say that one is better, but that is a judgement.
An important caveat must be noted here. The terms do not refer to the social values of putting the utility of the data before the privacy of the participants, or vice versa. Those values are external to the model and are constraints enforced by reality. Instead, we are merely talking about which paradigm gives us better analytical techniques to achieve both the utility and privacy requirements to the extent possible.
The shift. With utility-first, you have strong, well-understood guarantees on the usefulness of the data, but typically only a heuristic analysis of privacy. What this translates to is an upper bound on privacy. With privacy-first, you have strong, well-understood privacy guarantees, but you only know how to perform certain types of computations on the data. So you have a lower bound on utility.
That’s where things get interesting. Utility-first starts to look worse as time goes on, as we discover more and more inferential techniques for breaching the privacy of participants. Privacy-first starts to look better with time, as we discover that more and more types of data-mining can be carried out due to innovative algorithms. And that is exactly how things have played out over the last few years.
I was at a similarly themed workshop at Bertinoro, Italy back in 2005, with much the same audience in attendance. Back then, the two views were about equally prevalent; the first papers on differential privacy were being written or had just been written (of course, the paradigm itself was not new). Fast forward 5 years, and the proponents of one view have started to win over the other, although we quibble to no small extent over the details. Overall, though, the shift has happened in a swift and amicable way, with both sides now largely agreeing on differential privacy.
Why did privacy-first win? I can see many reasons. The privacy protections of the utility-first techniques kept getting broken (a Kuhnian “crisis”?); the de-anonymization research that I and others worked on played a big part here. Another reason might be the way the cryptographic community operates: once they decide that a paradigm is worth investigating, they tend to jump in on it all at once and pick the bones clean. That ensured that within a few years, a huge number of results of the form “how to compute X with differential privacy” were published. A third reason might very well be the fact that these interdisciplinary workshops exist, giving us an opportunity to change each other’s minds.
The fallout. While the debate in theoretical circles seems largely over, the ripple effects are going to be felt “downstream” for a long time to come. Differential privacy is only slowly penetrating other areas of research where privacy is a peripheral but not a fundamental object of study. As for law and policy, Ohm’s paper on the failure of anonymization has certainly created a bang there.
That leaves the most important contingent: practitioners. Technology companies have been quick to learn the lessons — differential privacy was invented by Microsoft researchers — and have been studying questions like sharing search logs with differential privacy assurances and building programming systems incorporating differential privacy (see PINQ developed at Microsoft Research and Airavat funded by Google.)
Other sectors, especially medical informatics, have been far slower to adapt, and it is not clear if they ever will. Multiple speakers at this workshop dealing with applications in different sectors talked about their efforts at anonymizing high-dimensional data (good luck with that). The problems are compounded by the fact that differential privacy isn’t yet at a point where it is easily usable in applications and in many cases the upshot of the theory has been to prove that the simultaneous utility and privacy requirements simply cannot be met. It will probably be the better part of a decade before differential privacy starts to make any real headway into real-world usage.
Summary. I hope I’ve shown you what scientific “paradigms” are, how they are adopted and discarded. Paradigm shifts are important turning points for scientific disciplines and often have big consequences for society as a whole. Finally, science is not a cold sequence of deductions but is done by real people with real motivations; the scientific process has a significant social and cultural component, even if the output of science is objective.