Data Privacy: The Story of a Paradigm Shift
Let’s take a break from the Ubercookies series. I’m at the IPAM data privacy workshop in LA, and I want to tell you about the kind of unusual scientific endeavor that it represents. I’ve recently started to write about the process of doing science, what’s good and what’s bad about it, and I expect to have more to say on this topic in this blog.
While “paradigm shift” has become a buzzword, the original sense in which Kuhn used it refers to a specific scientific process. I’ve had the rare experience of witnessing such a paradigm shift unfold, and I may even have played a small part. I am going to tell that story. I hope it will give you a “behind-the-scenes” look into how science works.
I will sidestep the question of whether data privacy is a science. I think it is a science to the extent that computer science is a science. At any rate, I think this narrative provides a nice illustration of Kuhn’s ideas.
First I need to spend some time setting up the scene and the actors. (I’m going to take some liberties and simplify things for the benefit of the broader audience, and I hope my colleagues will forgive me for it.)
The scene. Privacy research is incredibly multidisciplinary, and this workshop represents one extreme of the spectrum: the math behind data privacy. The mathematical study of privacy in databases centers on one question:
If you have a bunch of data collected from individuals, and you want to let other people do something useful with the data, such as learning correlations, how do you do it without revealing individual information?
There are roughly 3 groups that investigate this question and are represented here:
- computer scientists with a background in cryptography / theoretical CS
- computer scientists with a background in databases and data mining
This classification is neither exhaustive nor strict, but it will suffice for my current purposes.
One of the problems with science and math research is that different communities studying different aspects of the the the same problem (or even studying the same problem from different perspectives) don’t meet together very often. For one, there is a good deal of friction in overcoming the language barriers (different names/ways of thinking about the same things). For another, academics are rewarded primarily for publishing in their own communities. That is why the organizers deserve a ton of credit for bridging the barriers and getting people together.
The paradigms. There is a fundamental, inescapable tension between the utility of data and the privacy of the participants. That’s the one thing that theorists and practitioners can agree on :-) Given that fact, there are two approaches to go about building a theory of privacy-protection, which I will call utility-first and privacy-first. Statisticians and database people tend to prefer the former paradigm, and cryptographers the latter; but this is not a clean division.
Utility-first hopes to be able to preserve the statistical computations that we would want to do if we didn’t have to worry about privacy, and then ask, “how can we improve the privacy of participants while still doing all these things?” Data anonymization is one natural technique that comes out of this world view: if you are only doing simple syntactic transformations to the data, the utility of the data is not affected very much.
On the other hand, privacy-first says, “let’s first figure out a rigorously provable way to assure the privacy of participants, and then go about figuring out what are the types of computations that can be carried out under this rubric.” The community has collectively decided, with good reason, that differential privacy is the right rubric to use. To explain it properly would require many Greek symbols, so I won’t.
Privacy-first and utility-first are scientific paradigms, not theories. Neither is falsifiable. We can say that one is better, but that is a judgement.
An important caveat must be noted here. The terms do not refer to the social values of putting the utility of the data before the privacy of the participants, or vice versa. Those values are external to the model and are constraints enforced by reality. Instead, we are merely talking about which paradigm gives us better analytical techniques to achieve both the utility and privacy requirements to the extent possible.
The shift. With utility-first, you have strong, well-understood guarantees on the usefulness of the data, but typically only a heuristic analysis of privacy. What this translates to is an upper bound on privacy. With privacy-first, you have strong, well-understood privacy guarantees, but you only know how to perform certain types of computations on the data. So you have a lower bound on utility.
That’s where things get interesting. Utility-first starts to look worse as time goes on, as we discover more and more inferential techniques for breaching the privacy of participants. Privacy-first starts to look better with time, as we discover that more and more types of data-mining can be carried out due to innovative algorithms. And that is exactly how things have played out over the last few years.
I was at a similarly themed workshop at Bertinoro, Italy back in 2005, with much the same audience in attendance. Back then, the two views were about equally prevalent; the first papers on differential privacy were being written or had just been written (of course, the paradigm itself was not new). Fast forward 5 years, and the proponents of one view have started to win over the other, although we quibble to no small extent over the details. Overall, though, the shift has happened in a swift and amicable way, with both sides now largely agreeing on differential privacy.
Why did privacy-first win? I can see many reasons. The privacy protections of the utility-first techniques kept getting broken (a Kuhnian “crisis”?); the de-anonymization research that I and others worked on played a big part here. Another reason might be the way the cryptographic community operates: once they decide that a paradigm is worth investigating, they tend to jump in on it all at once and pick the bones clean. That ensured that within a few years, a huge number of results of the form “how to compute X with differential privacy” were published. A third reason might very well be the fact that these interdisciplinary workshops exist, giving us an opportunity to change each other’s minds.
The fallout. While the debate in theoretical circles seems largely over, the ripple effects are going to be felt “downstream” for a long time to come. Differential privacy is only slowly penetrating other areas of research where privacy is a peripheral but not a fundamental object of study. As for law and policy, Ohm’s paper on the failure of anonymization has certainly created a bang there.
That leaves the most important contingent: practitioners. Technology companies have been quick to learn the lessons — differential privacy was invented by Microsoft researchers — and have been studying questions like sharing search logs with differential privacy assurances and building programming systems incorporating differential privacy (see PINQ developed at Microsoft Research and Airavat funded by Google.)
Other sectors, especially medical informatics, have been far slower to adapt, and it is not clear if they ever will. Multiple speakers at this workshop dealing with applications in different sectors talked about their efforts at anonymizing high-dimensional data (good luck with that). The problems are compounded by the fact that differential privacy isn’t yet at a point where it is easily usable in applications and in many cases the upshot of the theory has been to prove that the simultaneous utility and privacy requirements simply cannot be met. It will probably be the better part of a decade before differential privacy starts to make any real headway into real-world usage.
Summary. I hope I’ve shown you what scientific “paradigms” are, how they are adopted and discarded. Paradigm shifts are important turning points for scientific disciplines and often have big consequences for society as a whole. Finally, science is not a cold sequence of deductions but is done by real people with real motivations; the scientific process has a significant social and cultural component, even if the output of science is objective.