Posts tagged ‘blog_dape’

One more re-identification demonstration, and then I’m out

What should we do about re-identification? Back when I started this blog in grad school seven years ago, I subtitled it “The end of anonymous data and what to do about it,” anticipating that I’d work on re-identification demonstrations as well as technical and policy solutions. As it turns out, I’ve looked at the former much more often than the latter. That said, my recent paper A Precautionary Approach to Big Data Privacy with Joanna Huey and Ed Felten tackles the “what to do about it” question head-on. We present a comprehensive set of recommendations for policy makers and practitioners.

One more re-identification demonstration, and then I’m out. Overall, I’ve moved on in terms of my research interests to other topics like web privacy and cryptocurrencies. That said, there’s one fairly significant re-identification demonstration I hope to do some time this year. This is something I started in grad school, obtained encouraging preliminary results on, and then put on the back burner. Stay tuned.

Machine learning and re-identification. I’ve argued that the algorithms used in re-identification turn up everywhere in computer science. I’m still interested in these algorithms from this broader perspective. My recent collaboration on de-anonymizing programmers using coding style is a good example. It uses more sophisticated machine learning than most of my earlier work on re-identification, and the potential impact is more in forensics than in privacy.

Privacy and ethical issues in big data. There’s a new set of thorny challenges in big data — privacy-violating inferences, fairness of machine learning, and ethics in general. I’m collaborating with technology ethics scholar Solon Barocas on these topics. Here’s an abstract we wrote recently, just to give you a flavor of what we’re doing:

How to do machine learning ethically

Every now and then, a story about inference goes viral. You may remember the one about Target advertising to customers who were determined to be pregnant based on their shopping patterns. The public reacts by showing deep discomfort about the power of inference and says it’s a violation of privacy. On the other hand, the company in question protests that there was no wrongdoing — after all, they had only collected innocuous information on customers’ purchases and hadn’t revealed that data to anyone else.

This common pattern reveals a deep disconnect between what people seem to care about when they cry privacy foul and the way the protection of privacy is currently operationalized. The idea that companies shouldn’t make inferences based on data they’ve legally and ethically collected might be disturbing and confusing to a data scientist.

And yet, we argue that doing machine learning ethically means accepting and adhering to boundaries on what’s OK to infer or predict about people, as well as how learning algorithms should be designed. We outline several categories of inference that run afoul of privacy norms. Finally, we explain why ethical considerations sometimes need to be built in at the algorithmic level, rather than being left to whoever is deploying the system. While we identify a number of technical challenges that we don’t quite know how to solve yet, we also provide some guidance that will help practitioners avoid these hazards.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

March 23, 2015 at 8:20 am Leave a comment

Good and bad reasons for anonymizing data

Ed Felten and I recently wrote a response to a poorly reasoned defense of data anonymization. This doesn’t mean, however, that there’s never a place for anonymization. Here’s my personal view on some good and bad reasons for anonymizing data before sharing it.

Good: We’re using anonymization to keep honest people honest. We’re only providing the data to insiders (employees) or semi-insiders (research collaborators), and we want to help them resist the temptation to peep.

Probably good: We’re sharing data only with a limited set of partners. These partners have a reputation to protect; they have also signed legal agreements that specify acceptable uses, retention periods, and audits.

Possibly good: We de-identified the data at a big cost in utility — for example, by making high-dimensional data low-dimensional via “vertical partitioning” — but it still enables some useful data analysis. (There are significant unexplored research questions here, and technically sound privacy guarantees may be possible.)

Reasonable: The data needed to be released no matter what; techniques like differential privacy didn’t produce useful results on our dataset. We released de-identified data and decided to hope for the best.

Reasonable: The auxiliary data needed for de-anonymization doesn’t currently exist publicly and/or on a large scale. We’re acting on the assumption that it won’t materialize in a relevant time-frame and are willing to accept the risk that we’re wrong.

Ethically dubious: The privacy harm to individuals is outweighed by the greater good to society. Related: de-anonymization is not as bad as many other privacy risks that consumers face.

Sometimes plausible: The marginal benefit of de-anonymization (compared to simply using the auxiliary dataset for marketing or whatever purpose) is so low that even the small cost of skilled effort is a sufficient deterrent. Adversaries will prefer other means of acquiring equivalent data — through purchase, if they are lawful, or hacking, if they’re not.[*]

Bad: Since there aren’t many reports of de-anonymization except research demonstrations, it’s safe to assume it isn’t happening.

It’s surprising how often this argument is advanced considering that it’s a complete non-sequitur: malfeasors who de-anonymize are obviously not going to brag about it. The next argument is a self-interested version takes this fact into account.

Dangerously rational: There won’t be a PR fallout from releasing anonymized data because researchers no longer have the incentive for de-anonymization demonstrations, whereas if malfeasors do it they won’t publicize it (elaborated here).

Bad: The expertise needed for de-anonymization is such a rare skill that it’s not a serious threat (addressed here).

Bad: We simulated some attacks and estimated that only 1% of records are at risk of being de-anonymized. (Completely unscientific; addressed here.)

Qualitative risk assessment is valuable; quantitative methods can be a useful heuristic to compare different choices of anonymization parameters if one has already decided to release anonymized data for other reasons, but can’t be used as a justification of the decision.

[*] This is my restatement of one of Yakowitz’s arguments in Tragedy of the Data Commons.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

July 9, 2014 at 8:05 am Leave a comment

Personalized coupons as a vehicle for perfect price discrimination

Given the pervasive tracking and profiling of our shopping and browsing habits, one would expect that retailers would be very good at individualized price discrimination —  figuring out what you or I would be willing to pay for an item using data mining, and tailoring prices accordingly. But this doesn’t seem to be happening. Why not?

This mystery isn’t new. Mathematician Andrew Odlyzko predicted a decade ago that data-driven price discrimination would become much more common and effective (paper, interview). Back then, he was far ahead of his time. But today, behavioral advertising at least has gotten good enough that it’s often creepy. The technology works; the impediment to price discrimination lies elsewhere. [1]

It looks like consumers’ perception of unfairness of price discrimination is surprisingly strong, which is why firms balk at overt price discrimination, even though covert price discrimination is all too common. But the covert form of price discrimination is not only less efficient, it also (ironically) has significant social costs — see #3 below for an example. Is there a form of pricing that allows for perfect discrimination (i.e., complete tailoring to individuals), in a way that consumers find acceptable? That would be the holy grail.

In this post, I will argue that the humble coupon, reborn in a high-tech form, could be the solution. Here’s why.

1. Coupons tap into shopper psychology. Customers love them.

Coupons, like sales, introduce unpredictability and rewards into shopping, which provides a tiny dopamine spike that gets us hooked. JC Penney’s recent misadventure in trying to eliminate sales and coupons provides an object lesson:

“It may be a decent deal to buy that item for $5. But for someone like me, who’s always looking for a sale or a coupon — seeing that something is marked down 20 percent off, then being able to hand over the coupon to save, it just entices me. It’s a rush.”

Some startups have exploited this to the hilt, introducing “gamification” into commerce. Shopkick is a prime example. I see this as a very important trend.

2. Coupons aren’t perceived as unfair.

Given the above, shoppers have at best a dim perception of coupons as a price discrimination mechanism. Even when they do, however, coupons aren’t perceived as unfair to nearly the same degree as listing different prices for different consumers, even if the result in either case is identical. [2]

3. Traditional coupons are not personalized.

While customers may have different reasons for liking coupons, from firms’ perspective the way in which traditional coupons aid price discrimination is pretty simple: by forcing customers to waste their time. Econ texts tend to lay it out bluntly. For example, R. Preston McAfee:

Individuals generally value their time at approximately their wages, so that people with low wages, who tend to be the most price-sensitive, also have the lowest value of time. … A thrifty shopper may be able to spend an hour sorting through the coupons in the newspaper and save $20 on a $200 shopping expedition … This is a good deal for a consumer who values time at less than $20 per hour, and a bad deal for the consumer that values time in excess of $20 per hour. Thus, relatively poor consumers choose to use coupons, which permits the seller to have a price cut that is approximately targeted at the more price-sensitive group.

Clearly, for this to be effective, coupon redemption must be deliberately made time-consuming.

To the extent that there is coupon personalization, it seems to be for changing shopper behavior (e.g., getting them to try out a new product) rather than a pricing mechanism. The NYT story from last year about Target targeting pregnant women falls into this category. That said, these different forms of personalization aren’t entirely distinct, which is a point I will return to in a later article.

4. The traditional model doesn’t work well any more.

Paper coupons have a limited future. As for digital coupons, there is a natural progression toward interfaces that make it easier to acquire and redeem them. In particular, as more shoppers start to pay using their phones in stores, I anticipate coupon redemption being integrated into payment apps, thus becoming almost frictionless.

An interesting side-effect of smartphone-based coupon redemption is that it gives the shopper more privacy, avoiding the awkwardness of pulling out coupons from a purse or wallet. This will further open up coupons to a wealthier demographic, making them even less effective at discriminating between wealthier shoppers and less affluent ones.

5. The coupon is being reborn in a data-driven, personalized form.

With behavioral profiling, companies can determine how much a consumer will pay for a product, and deliver coupons selectively so that each customer’s discount reflects what they are willing to pay. They key difference is what while in the past, customers decided whether or not to look for, collect, and use a coupon, in the new model companies will determine who gets which coupons.

In the extreme, coupons will be available for all purchases, and smart shopping software on our phones or browsers will automatically search, aggregate, manage, and redeem these coupons, showing coupon-adjusted prices when browsing for products. More realistically, the process won’t be completely frictionless, since that would lose the psychological benefit. Coupons will probably also merge with “rewards,” “points,” discounts, and various other incentives.

There have been rumblings of this shift here and there for a few years now, and it seems to be happening gradually. Google’s acquisition of Incentive Targeting a few months ago seems significant, and at the very least demonstrates that tech companies are eyeing this space as well, and not just retailers. As digital feudalism takes root, it could accelerate the trend of individualized shopping experiences.

In summary, personalized coupons offer a vehicle for realizing the full potential of data mining for commerce by tailoring prices in a way that consumers seem to find acceptable. Neither coupons nor price discrimination should be viewed in isolation — together with rewards and various other incentive schemes, they are part of the trend of individualized, data mining-driven commerce that’s here to stay.

Footnotes

[1] Since I’m eschewing some academic terminology in this post, here are a few references and points of clarification. My interest is in first-degree price discrimination. Any price discrimination requires market power; my assumption is that is the case in practice because competition is always imperfect, and we should expect quite a bit of first-degree price discrimination. The observed level is puzzlingly low.

The impact of technology on the ability to personalize prices is complex, and behavioral profiling is only one aspect. Technology also makes competition less perfect by allowing firms to customize products to a greater degree, so that there are no exact substitutes. Finally, technology hinders first-degree price discrimination to an extent by allowing consumers to compare prices between different retailers more easily. The interaction between these effects is analyzed in this paper.

Technology also increases the incentive to price discriminate. As production becomes more and more automated, marginal costs drop relative to fixed costs. In the extreme, digital goods have essentially zero marginal cost. When marginal production costs are low, firms will try to tailor prices since any sale above marginal cost increases profits.

My use of the terms overt and covert is rooted in the theory of price fairness in psychology and behavioral economics, and relates to the presentation of the transaction. While it is somewhat related to first- vs. second/third-degree price discrimination, it is better understood as a separate axis, one that is not captured by theories of rational firms and consumers.

[2] An exception is when non-coupon customers are made aware that others are getting a better deal. This happens, for example, when there is a prominent coupon-code form field in an online shopping checkout flow. See here for a study.

Thanks to Sebastian Gold for reviewing a draft, and to Justin Brickell for interesting conversations that led me to this line of thinking.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

June 25, 2013 at 7:09 am 8 comments

Reidentification as Basic Science

This essay originally appeared on the Bill of Health blog as part of a conversation on the law, ethics and science of reidentification demonstrations.

What really drives reidentification researchers? Do we publish these demonstrations to alert individuals to privacy risks? To shame companies? For personal glory? If our goal is to improve privacy, are we doing it in the best way possible?

In this post I’d like to discuss my own motivations as a reidentification researcher, without speaking for anyone else. Certainly I care about improving privacy outcomes, in the sense of making sure that companies, governments and others don’t get away with mathematically unsound promises about the privacy of consumers’ data. But there is a quite different goal I care about at least as much: reidentification algorithms. These algorithms are my primary object of study, and so I see reidentification research partly as basic science.

Let me elaborate on why reidentification algorithms are interesting and important. First, they yield fundamental insights about people — our interests, preferences, behavior, and connections — as reflected in the datasets collected about us. Second, as is the case with most basic science, these algorithms turn out to have a variety of applications other than reidentification, both for good and bad. Let us consider some of these.

First and foremost, reidentification algorithms are directly applicable in digital forensics and intelligence. Analyzing the structure of a terrorist network (say, based on surveillance of movement patterns and meetings) to assign identities to nodes is technically very similar to social network deanonymization. A reidentification researcher that I know who is a U.S. citizen tells me he has been contacted more than once by intelligence agencies to apply his expertise to their data.

Homer et al’s work on identifying individuals in DNA mixtures is another great example of how forensics algorithms are inextricably linked to privacy-infringing applications. In addition to DNA and network structure, writing style and location trails are other attributes that have been utilized both in reidentification and forensics.

It is not a coincidence that the reidentification literature often uses the word “fingerprint” — this body of work has generalized the notion of a fingerprint beyond physical attributes to a variety of other characteristics. Just like physical fingerprints, there are good uses and bad, but regardless, finding generalized fingerprints is a contribution to human knowledge. A fundamental question is how much information (i.e., uniqueness) there is in each of these types of attributes or characteristics. Reidentification research is gradually helping answer this question, but much remains unknown.

It is not only people that are fingerprintable — so are various physical devices. A wonderful set of (unrelated) research papers has shown that many types of devices, objects, and software systems, even supposedly identical ones, are have unique fingerprints: blank paperdigital camerasRFID tagsscanners and printers, and web browsers, among others. The techniques are similar to reidentification algorithms, and once again straddle security-enhancing and privacy-infringing applications.

Even more generally, reidentification algorithms are classification algorithms for the case when the number of classes is very large. Classification algorithms categorize observed data into one of several classes, i.e., categories. They are at the core of machine learning, but typical machine-learning applications rarely need to consider more than several hundred classes. Thus, reidentification science is helping develop our knowledge of how best to extend classification algorithms as the number of classes increases.

Moving on, research on reidentification and other types of “leakage” of information reveals a problem with the way data-mining contests are run. Most commonly, some elements of a dataset are withheld, and contest participants are required to predict these unknown values. Reidentification allows contestants to bypass the prediction process altogether by simply “looking up” the true values in the original data! For an example and more elaborate explanation, see this post on how my collaborators and I won the Kaggle social network challenge. Demonstrations of information leakage have spurred research on how to design contests without such flaws.

If reidentification can cause leakage and make things messy, it can also clean things up. In a general form, reidentification is about connecting common entities across two different databases. Quite often in real-world datasets there is no unique identifier, or it is missing or erroneous. Just about every programmer who does interesting things with data has dealt with this problem at some point. In the research world, William Winkler of the U.S. Census Bureau has authored a survey of “record linkage”, covering well over a hundred papers. I’m not saying that the high-powered machinery of reidentification is necessary here, but the principles are certainly useful.

In my brief life as an entrepreneur, I utilized just such an algorithm for the back-end of the web application that my co-founders and I built. The task in question was to link a (musical) artist profile from last.fm to the corresponding Wikipedia article based on discography information (linking by name alone fails in any number of interesting ways.) On another occasion, for the theory of computing blog aggregator that I run, I wrote code to link authors of papers uploaded to arXiv to their DBLP profiles based on the list of coauthors.

There is more, but I’ll stop here. The point is that these algorithms are everywhere.

If the algorithms are the key, why perform demonstrations of privacy failures? To put it simply, algorithms can’t be studied in a vacuum; we need concrete cases to test how well they work. But it’s more complicated than that. First, as I mentioned earlier, keeping the privacy conversation intellectually honest is one of my motivations, and these demonstrations help. Second, in the majority of cases, my collaborators and I have chosen to examine pairs of datasets that were already public, and so our work did not uncover the identities of previously anonymous subjects, but merely helped to establish that this could happen in other instances of “anonymized” data sharing.

Third, and I consider this quite unfortunate, reidentification results are taken much more seriously if researchers do uncover identities, which naturally gives us an incentive to do so. I’ve seen this in my own work — the Netflix paper is the most straightforward and arguably the least scientifically interesting reidentification result that I’ve co-authored, and yet it received by far the most attention, all because it was carried out on an actual dataset published by a company rather than demonstrated hypothetically.

My primary focus on the fundamental research aspect of reidentification guides my work in an important way. There are many, many potential targets for reidentification — despite all the research, data holders often (rationally) act like nothing has changed and continue to make data releases with “PII” removed. So which dataset should I pick to work on?

Focusing on the algorithms makes it a lot easier. One of my criteria for picking a reidentification question to work on is that it must lead to a new algorithm. I’m not at all saying that all reidentification researchers should do this, but for me it’s a good way to maximize the impact I can hope for from my research, while minimizing controversies about the privacy of the subjects in the datasets I study.

I hope this post has given you some insight into my goals, motivations, and research outputs, and an appreciation of the fact that there is more to reidentification algorithms than their application to breaching privacy. It will be useful to keep this fact in the back of our minds as we continue the conversation on the ethics of reidentification.

Thanks to Vitaly Shmatikov for reviewing a draft.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

May 27, 2013 at 6:16 am Leave a comment

Price Discrimination and the Illusion of Fairness

In my previous article I pointed out that online price discrimination is suspiciously absent in directly observable form, even though covert price discrimination is everywhere. Now let’s talk about why that might be.

By “covert” I don’t mean that the firm is trying to keep price discrimination a secret. Rather, I mean that the differential treatment isn’t made explicit — e.g., by not basing it directly on a customer attribute — and thereby avoiding triggering the perception of unfairness or discrimination. A common example is selective distribution of coupons instead of listing different prices. Such discounting may be publicized, but it is still covert.

The perception of fairness

The perception of fairness or unfairness, then, is at the heart of what’s going on. Going back to the WSJ piece, I found it interesting to see the reaction of the customer to whom Staples quoted $1.50 more for a stapler based on her ZIP code: “How can they get away with that?” she asks. To which my initial reaction was, “Get away with what, exactly? Supply and demand? Econ 101?”

Even though some of us might not feel the same outrage, I think all of us share at least a vague sense of unease about overt price discrimination. So I decided to dig deeper into the literature in psychology, marketing, and behavioral economics on the topic of price fairness and understand where this perception comes from. What I found surprised me.

First, the fairness heuristic is quite elaborate and complex. In a vast literature spanning several decades, early work such as the “principle of dual entitlement” by Kahneman and coauthors established some basics. Quoting Anderson and Simester: “This theory argues that customers’ have perceived fairness levels for both firm profits and retail prices. Although firms are entitled to earn a fair profit, customers are also entitled to a fair price. Deviations from a fair price can be justified only by the firm’s need to maintain a fair profit. According to this argument, it is fair for retailers to raise the price of snow shovels if the wholesale price increases, but it is not fair to do so if a snowstorm leads to excess demand.”

Much later work has added to and refined that model. A particularly impressive and highly cited 2004 paper reviews the literature and proposes an elaborate framework with four different classes inputs to explain how people decide if pricing is fair or unfair in various situations. Some of the findings are quite surprising. For example: in case of differential pricing to the buyer’s disadvantage, “trust in the seller has a U-shaped effect on price fairness perceptions.”

The illusion of fairness

Sounds like we have a well-honed and sophisticated decision procedure, then? Quite the opposite, actually. The fairness heuristic seems to be rather fragile, even if complex.

Let’s start with an example. Andrew Odlyzko, in his brilliant essay on price discrimination — all the more for the fact that it was published back in 2003 [1] — has this to say about Coca Cola’s ill-fated plans for price-adjusting vending machines: “In retrospect, Coca Cola’s main problem was that news coverage always referred to its work as leading to vending machines that would raise prices in warm weather. Had it managed to control publicity and present its work as leading to machines that would lower prices in cold weather, it might have avoided the entire controversy.”

We know how to explain the public’s reaction to the Coca Cola announcement using behavioral economics — the way it was presented (or framed), customers take the lower price as the “reference price,” and the price increase seems unfair, whereas the Odlyzko’s suggested framing would anchor the higher price as the reference price. Of course, just because we can explain how the fairness heuristic works doesn’t make it logical or consistent, let alone properly grounded in social justice.

More generally, every aspect of our mental price fairness assessment heuristic seems similarly vulnerable to hijacking by tweaking the presentation of the transaction without changing the essence of price discrimination. Companies have of course gotten wise to this; there’s even academic literature on it. One of the techniques proposed in this paper is “reference group signaling” — getting a customer to change the set of other customers to whom they mentally compare themselves. [2]

The perception of fairness, then, can be more properly called the illusion of fairness.

The fragility of the fairness heuristic becomes less surprising considering that we apparently share it with other primates. This hilarious clip from a TED talk shows a capuchin monkey reacting poorly, to put it mildly, to differential treatment in a monkey-commerce setting (although the jury may still be out on the significance of this experiment). If our reaction to pricing schemes is partly or largely due to brain circuitry that evolved millions of years ago, we shouldn’t expect it to fare well when faced with the complexities of modern business.

Lose-lose

Given that the prime impediment to pervasive online price discrimination is a moral principle that is fickle and easily circumventable, one can expect that companies to do exactly that, since they can reap most of the benefits of price discrimination without the negative PR. Indeed, it is my belief that more covert price discrimination is going on than is generally recognized, and that it is accelerating due to some technological developments.

This is a problem because price discrimination does raise ethical concerns, and these concerns are every bit as significant when it is covert. [3] However, since it is much less transparent, there’s less of an opportunity for public debate.

There are two directions in which I want to take this series of articles from this point: first a look at how new technology is enabling powerful forms of tailoring and covert price discrimination, and second, a discussion of what can be done to make price discrimination more transparent and how to have an informed policy discussion about its benefits and dangers.

[1] I had the pleasure of sitting next to Professor Odlyzko at a conference dinner once, and I  expressed my admiration of the prescience of his article. He replied that he’d worked it all out in his head circa 1996 but took a few years to put it down on paper. I could only stare at him wordlessly.

[2] I’m struck by the similarities between price fairness perceptions and privacy perceptions. The aforementioned 2004 price fairness framework can be seen as serving a roughly analogous function to contextual integrity, which is (in part) a theory of consumer privacy expectations. Both these theories are the result of “reverse engineering,” if you will, of the complex mental models in their respective domains using empirical behavioral evidence. Continuing the analogy, privacy expectations are also fragile, highly susceptible to framing, and liable to be exploited by companies. Acquisti and Grossklags, among others, have done some excellent empirical work on this.

[3] In fact, crude ways of making customers reveal their price sensitivity lead to a much higher social cost than overt price discrimination. I will take this up in more detail in a future post.

Thanks to Alejandro Molnar, Joseph Bonneau, Solon Barocas, and many others for insightful conversations on this topic.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

January 22, 2013 at 10:24 am 10 comments

New Developments in Deanonymization

This post is a roundup of developments in deanonymization in the last few months. Let’s start with two stories relating to how a malicious website can silently discover the identity of a visitor, which is an insidious type of privacy breach that I’ve written about quite a bit (1, 2, 3, 4, 5, 6).

Firefox bug exposed your identity. The first is a vulnerability resulting from a Firefox bug in the implementation of functions like exec and test. The bug allows a website to learn the URL of an embedded iframe from some other domain. How can this lead to uncovering the visitor’s identity? Because twitter.com/lists redirects to twitter.com/<username>/lists. This allows a malicious website to open a hidden iframe pointing to twitter.com/lists, query the URL after redirection, and learn the visitor’s Twitter handle (if they are logged in). [1,2]

This is very similar to a previous bug in Firefox that led to the same type of vulnerability. The URL redirect that was exploited there was google.com/profiles/me  → user-specific URL. It would be interesting to find and document all such generic-URL → user-specific-URL redirects in major websites. I have a feeling this won’t be the last time such redirection will be exploited.

Visitor deanonymization in the wild. The second story is an example of visitor deanonymization happening in the wild. It appears that the technique utilizes a tracking cookie from a third-party domain to which the visitor previously gave their email and other info., in other words, #3 in my five-fold categorization of ways in which identity can be attached to browsing logs.

I don’t consider this instance to be particularly significant — I’m sure there are other implementations in the wild — and it’s not technically novel, but this is the first time as far as I know that it’s gotten significant attention from the public, even if only in tech circles. I see this as a first step in a feedback loop of changing expectations about online anonymity emboldening more sites to deanonymize visitors, thus further lowering the expectation of privacy.

Deanonymization of mobility traces. Let’s move on to the more traditional scenario of deanonymization of a dataset by combining it with an auxiliary, public dataset which has users’ identities. Srivatsa and Hicks have a new paper with demonstrations of deanonymization of mobility traces, i.e., logs of users’ locations over time. They use public social networks as auxiliary information, based on the insight that pairs of people who are friends are more likely to meet with each other physically. The deanonymization of Bluetooth contact traces of attendees of a conference based on their DBLP co-authorship graph is cute.

This paper adds to the growing body of evidence that anonymization of location traces can be reversed, even if the data is obfuscated by introducing errors (noise).

So many datasets, so little time. Speaking of mobility traces, Jason Baldridge points me to a dataset containing mobility traces (among other things) of 5 million “anonymous” users in the Ivory Coast recently released by telecom operator Orange. A 250-word research proposal is required to get access to the data, which is much better from a privacy perspective than a 1-click download. It introduces some accountability without making it too onerous to get the data.

In general, the incentive for computer science researchers to perform practical demonstrations of deanonymization has diminished drastically. Our goal has always been to showcase new techniques and improve our understanding of what’s possible, and not to name and shame. Even if the Orange dataset were more easily downloadable, I would think that the incentive for deanonymization researchers would be low, now that the Srivatsa and Hicks paper exists and we know for sure that mobility traces can be deanonymized, even though the experiments in the paper are on a far smaller scale.

Head in the sand: rational?! I gave a talk at a privacy workshop recently taking a look back at how companies have reacted to deanonymization research. My main point was that there’s a split between the take-your-data-and-go-home approach (not releasing data because of privacy concerns) and the head-in-the-sand approach (pretending the problem doesn’t exist). Unfortunately but perhaps unsurprisingly, there has been very little willingness to take a middle ground, engaging with data privacy researchers and trying to adopt technically sophisticated solutions.

Interestingly, head-in-the-sand might be rational from companies’ point of view. On the one hand, researchers don’t have the incentive for deanonymization anymore. On the other hand, if malicious entities do it, naturally they won’t talk about it in public, so there will be no PR fallout. Regulators have not been very aggressive in investigating anonymized data releases in the absence of a public outcry, so that may be a negligible risk.

Some have questioned whether deanonymization in the wild is actually happening. I think it’s a bit silly to assume that it isn’t, given the economic incentives. Of course, I can’t prove this and probably never can. No company doing it will publicly talk about it, and the privacy harms are so indirect that tying them to a specific data release is next to impossible. I can only offer anecdotes to explain my position: I have been approached multiple times by organizations who wanted me to deanonymize a database they’d acquired, and I’ve had friends in different industries mention casually that what they do on a daily basis to combine different databases together is essentially deanonymization.

[1] For a discussion of why a social network profile is essentially equivalent to an identity, see here and the epilog here.
[2] Mozilla pulled Firefox 16 as a result and quickly fixed the bug.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

December 17, 2012 at 8:59 am 1 comment

Data-mining Contests and the Deanonymization Dilemma: a Two-stage Process Could Be the Way Out

Anonymization, once the silver bullet of privacy protection in consumer databases, has been shown to be fundamentally inadequate by the work of many computer scientists including myself. One of the best defenses is to control the distribution of the data: strong acceptable-use agreements including prohibition of deanonymization and limits on data retention.

These measures work well when outsourcing data to another company or a small set of entities. But what about scientific research and data mining contests involving personal data? Prizes are big and only getting bigger, and by their very nature involve wide data dissemination. Are legal restrictions meaningful or enforceable in this context?

I believe that having participants sign and fax a data-use agreement is much better from the privacy perspective than being able to download the data with a couple of clicks. However, I am sympathetic to the argument that I hear from contest organizers that every extra step will result a big drop-off in the participation rate. Basic human psychology suggests that instant gratification is crucial.

That is a dilemma. But the more I think about it, the more I’m starting to feel that a two-step process could be a way to get the best of both worlds. Here’s how it would work.

For the first stage, the current minimally intrusive process is retained, but the contestants don’t get to download the full data. Instead, there are two possibilities.

  • Release data on only a subset of users, minimizing the quantitative risk. [1]
  • Release a synthetic dataset created to mimic the characteristics of the real data. [2]

For the second stage, there are various possibilities, not mutually exclusive:

  • Require contestants to sign a data-use agreement.
  • Restrict the contest to a shortlist of best performers from the first stage.
  • Switch to an “online computation model” where participants upload code to the server (or make database queries over the network) and obtain results, rather than download data.

Overstock.com recently announced a contest that conformed to this structure—a synthetic data release followed by a semi-final and a final round in which selected contestants upload code to be evaluated against data. The reason for this structure appears to be partly privacy and partly the fact that are trying to improve the performance of their live system, and performance needs to be judged in terms of impact on real users.

In the long run, I really hope that an online model will take root. The privacy benefits are significant: high-tech machinery like differential privacy works better in this setting. But even if such techniques are not employed, although there is the theoretical possibility of contestants extracting all the data by issuing malicious queries, the fact that queries are logged and might be audited should serve as a strong deterrent against such mischief.

The advantages of the online model go beyond privacy. For example, I served on the Heritage Health Prize advisory board, and we discussed mandating a limit on the amount of computation that contestants were allowed. The motivation was to rule out algorithms that needed so much hardware firepower that they couldn’t be deployed in practice, but the stipulation had to be rejected as unenforceable. In an online model, enforcement would not be a problem. Another potential benefit is the possibility of collaboration between contestants at the code level, almost like an open-source project.

[1] Obtaining informed consent from the subset whose data is made publicly available would essentially eliminate the privacy risk, but the caveat is the possibility of selection bias.

[2] Creating a synthetic dataset from a real one without leaking individual data points and at the same time retaining the essential characteristics of the data is a serious technical challenge, and whether or not it is feasible will depend on the nature of the specific dataset.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

June 14, 2011 at 6:54 pm Leave a comment

Older Posts


About 33bits.org

I'm an assistant professor of computer science at Princeton. I research (and teach) information privacy and security, and moonlight in technology policy.

This is a blog about my research on breaking data anonymization, and more broadly about information privacy, law and policy.

For an explanation of the blog title and more info, see the About page.

Me, elsewhere

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 248 other followers