Good and bad reasons for anonymizing data
Ed Felten and I recently wrote a response to a poorly reasoned defense of data anonymization. This doesn’t mean, however, that there’s never a place for anonymization. Here’s my personal view on some good and bad reasons for anonymizing data before sharing it.
Good: We’re using anonymization to keep honest people honest. We’re only providing the data to insiders (employees) or semi-insiders (research collaborators), and we want to help them resist the temptation to peep.
Probably good: We’re sharing data only with a limited set of partners. These partners have a reputation to protect; they have also signed legal agreements that specify acceptable uses, retention periods, and audits.
Possibly good: We de-identified the data at a big cost in utility — for example, by making high-dimensional data low-dimensional via “vertical partitioning” — but it still enables some useful data analysis. (There are significant unexplored research questions here, and technically sound privacy guarantees may be possible.)
Reasonable: The data needed to be released no matter what; techniques like differential privacy didn’t produce useful results on our dataset. We released de-identified data and decided to hope for the best.
Reasonable: The auxiliary data needed for de-anonymization doesn’t currently exist publicly and/or on a large scale. We’re acting on the assumption that it won’t materialize in a relevant time-frame and are willing to accept the risk that we’re wrong.
Ethically dubious: The privacy harm to individuals is outweighed by the greater good to society. Related: de-anonymization is not as bad as many other privacy risks that consumers face.
Sometimes plausible: The marginal benefit of de-anonymization (compared to simply using the auxiliary dataset for marketing or whatever purpose) is so low that even the small cost of skilled effort is a sufficient deterrent. Adversaries will prefer other means of acquiring equivalent data — through purchase, if they are lawful, or hacking, if they’re not.[*]
Bad: Since there aren’t many reports of de-anonymization except research demonstrations, it’s safe to assume it isn’t happening.
It’s surprising how often this argument is advanced considering that it’s a complete non-sequitur: malfeasors who de-anonymize are obviously not going to brag about it. The next argument is a self-interested version takes this fact into account.
Dangerously rational: There won’t be a PR fallout from releasing anonymized data because researchers no longer have the incentive for de-anonymization demonstrations, whereas if malfeasors do it they won’t publicize it (elaborated here).
Bad: The expertise needed for de-anonymization is such a rare skill that it’s not a serious threat (addressed here).
Bad: We simulated some attacks and estimated that only 1% of records are at risk of being de-anonymized. (Completely unscientific; addressed here.)
Qualitative risk assessment is valuable; quantitative methods can be a useful heuristic to compare different choices of anonymization parameters if one has already decided to release anonymized data for other reasons, but can’t be used as a justification of the decision.
[*] This is my restatement of one of Yakowitz’s arguments in Tragedy of the Data Commons.