Data-mining Contests and the Deanonymization Dilemma: a Two-stage Process Could Be the Way Out
Anonymization, once the silver bullet of privacy protection in consumer databases, has been shown to be fundamentally inadequate by the work of many computer scientists including myself. One of the best defenses is to control the distribution of the data: strong acceptable-use agreements including prohibition of deanonymization and limits on data retention.
These measures work well when outsourcing data to another company or a small set of entities. But what about scientific research and data mining contests involving personal data? Prizes are big and only getting bigger, and by their very nature involve wide data dissemination. Are legal restrictions meaningful or enforceable in this context?
I believe that having participants sign and fax a data-use agreement is much better from the privacy perspective than being able to download the data with a couple of clicks. However, I am sympathetic to the argument that I hear from contest organizers that every extra step will result a big drop-off in the participation rate. Basic human psychology suggests that instant gratification is crucial.
That is a dilemma. But the more I think about it, the more I’m starting to feel that a two-step process could be a way to get the best of both worlds. Here’s how it would work.
For the first stage, the current minimally intrusive process is retained, but the contestants don’t get to download the full data. Instead, there are two possibilities.
- Release data on only a subset of users, minimizing the quantitative risk. 
- Release a synthetic dataset created to mimic the characteristics of the real data. 
For the second stage, there are various possibilities, not mutually exclusive:
- Require contestants to sign a data-use agreement.
- Restrict the contest to a shortlist of best performers from the first stage.
- Switch to an “online computation model” where participants upload code to the server (or make database queries over the network) and obtain results, rather than download data.
Overstock.com recently announced a contest that conformed to this structure—a synthetic data release followed by a semi-final and a final round in which selected contestants upload code to be evaluated against data. The reason for this structure appears to be partly privacy and partly the fact that are trying to improve the performance of their live system, and performance needs to be judged in terms of impact on real users.
In the long run, I really hope that an online model will take root. The privacy benefits are significant: high-tech machinery like differential privacy works better in this setting. But even if such techniques are not employed, although there is the theoretical possibility of contestants extracting all the data by issuing malicious queries, the fact that queries are logged and might be audited should serve as a strong deterrent against such mischief.
The advantages of the online model go beyond privacy. For example, I served on the Heritage Health Prize advisory board, and we discussed mandating a limit on the amount of computation that contestants were allowed. The motivation was to rule out algorithms that needed so much hardware firepower that they couldn’t be deployed in practice, but the stipulation had to be rejected as unenforceable. In an online model, enforcement would not be a problem. Another potential benefit is the possibility of collaboration between contestants at the code level, almost like an open-source project.
 Obtaining informed consent from the subset whose data is made publicly available would essentially eliminate the privacy risk, but the caveat is the possibility of selection bias.
 Creating a synthetic dataset from a real one without leaking individual data points and at the same time retaining the essential characteristics of the data is a serious technical challenge, and whether or not it is feasible will depend on the nature of the specific dataset.