Posts tagged ‘privacy’
Previous articles in this series looked at fingerprinting of blank paper and digital cameras. This article is about fingerprinting of RFID, a domain where research has directly investigated the privacy threat, namely tracking people in public.
The principle behind RFID fingerprinting is the same as with digital cameras:
The basics. First let’s get the obvious question out of the way: why are we talking about devious methods of identifying RFID chips, when the primary raison d’être of RFID is to enable unique identification? Why not just use them in the normal way?
The answer is that fingerprinting, which exploits the physical properties of RFID chips rather than their logical behavior, allows identifying them in unintended ways and in unintended contexts, and this is powerful. RFID applications, for example in e-passports or smart cards, can often be cloned at the logical level, either because there is no authentication or because authentication is broken. Fingerprinting can make the system (more) secure, since fingerprints arise from microscopic randomness and there is no known way to create a tag with a given fingerprint.
If sensor patterns in digital cameras are a relatively clean example of fingerprinting, RF (and anything to do with the electromagnetic spectrum in general) is the opposite. First, the data is an arbitrary waveform instead of an fixed-size sequence of bits. This means that a simple point-by-point comparison won’t work for fingerprint verification; the task is conceptually more similar to algorithmically comparing two faces. Second, the probe signal itself is variable. RFID chips are passive: they respond to the signal produced by the reader (and draw power from it). This means that the fingerprinting system is in full control of what kind of signal to interrogate the chip with. It’s a bit like being given a blank canvas to paint on.
Techniques. A group at ETH Zurich has done some impressive work in this area. In their 2009 paper, they report being able to compare an RFID card with a stored fingerprint and determine if they are the same, with an error rate of 2.5%–4.5% depending on settings. They use two types of signals to probe the chip with — “burst” and “sweep” — and extract features from the response based on the spectrum.
Other papers have demonstrated different ways to generate signals/extract features. A University of Arkansas team exploited the minimum power required to get a response from the tag at various frequencies. The authors achieved a 94% true-positive rate using 50 identical tags, with only a 0.1% false-positive rate. (About 6% of the time, the algorithm didn’t produce an output.)
Yet other techniques, namely the energy and Q factor of higher harmonics were studied in a couple of papers out of NIST. In the latter work, they experimented with 20 cards which consisted of 4 batches of 5 ‘identical’ cards in each. The overall identification accuracy was 96%.
It seems safe to say that RFID fingerprinting techniques are still in their infancy, and there is much room for improvement by considering new categories of features, by combining different types of features, or by using different classification algorithms on the extracted features.
Privacy. RF fingerprinting, like other types of fingerprinting, shows a duality between security-enhancing and privacy-infringing applications, but in a less direct way. There are two types of RFID systems: “near-field” based on inductive coupling, used in contactless smartcards and the like, and “far field” based on backscatter, used in vehicle identification, inventory control, etc. The papers discussed so far pertain to near-field systems. There are no real privacy-infringing applications of near-field RF fingerprinting, because you can’t get close enough to extract a fingerprint without the owner of the tag knowing about it. Far-field systems, to which we will now turn, are ideally suited to high-tech stalking.
In a recent paper, the Zurich team mentioned earlier investigated the possibility of tracking a people in a shopping mall based on strategically placed sensors, assuming that shoppers have several (far-field) RFID tags on them. The point is that it is possible to design chips that prevent tracking at the logical level by authenticating the reader, but this is impossible at the physical level.
Why would people have RFID tags on them? Tags used for inventory control in stores, and not deactivated at the point-of-sale are one increasingly common possibility — they would end up in shopping bags (or even on clothes being worn, although that’s less likely). RFID tags in wallets and medical devices are another source; these are tags that the user wants to be present and functional.
What makes the tracking device the authors built powerful is that it is low-cost and can be operated surreptitiously at some distance from the victim: up to 2.75 meters, or 9 feet. They show that 5.4 bits of entropy can be extracted from a single tag, which means that 5 tags on a person gives 22 bits, easily enough to distinguish everyone who might be in a particular mall.
To assess the practical privacy risk, technological feasibility is only one dimension. We also need to ask who the adversary is and what the incentives are. Tracking people, especially shoppers, in physical space has the strongest incentive of all: selling products. While online tracking is pervasive, the majority of shopping dollars are still spent offline, and there’s still no good way to automatically identify people when they are in the vicinity in order to target offers to them. Facial recognition technology is highly error-prone and creeps people out, and that’s where RF fingerprinting comes in.
That said, RF fingerprinting is only one of the many ways of passively tracking people en masse in physical space — unintentional leaks of identifiers from smartphones and logical-layer identification of RFID tags seem more likely — but it’s probably the hardest to defend against. It is possible to disable RFID tags, but this is usually irreversible and it’s difficult to be sure you haven’t missed any. RFID jammers are another option but they are far from easy to use and are probably illegal in the U.S. One of the ETH Zurich researchers suggests tinfoil wrapping when going out shopping :-)
 Active RFID chips exist but most commercial systems use passive ones, and that’s what the fingerprinting research has focused on.
 They used a population of 50 tags, but this number is largely irrelevant since the experiment was one of binary classification rather than 1-out-of-n identification.
Thanks to Vincent Toubiana for comments on a draft.
By all accounts, Google has done a great job with Plus, both on privacy and on the closely related goal of better capturing real-life social nuances.  This article will summarize the privacy discussions I’ve had in the first few days of using the service and the news I’ve come across.
The origin of Circles
“Circles,” as you’re probably aware, is the big privacy-enhancing feature. A presentation titled “The Real-Life Social Network” by user-experience designer Paul Adams almost exactly a year ago went viral in the tech community; it looks likely this was the genesis, or at least a crystallization, of the Circles concept.
But Adams defected to Facebook a few months later, which lead to speculation that it was the end of whatever plans Google may have had for the concept. But little did the world know at the time that Plus was a company-wide, bet-the-farm initiative involving 30 product teams and hundreds of engineers, and that the departure of one made no difference.
Meanwhile, Facebook introduced a friend-lists feature but it was DOA. When you’re staring at a giant list of several hundred “friends” — Facebook doesn’t do a good job of discouraging indiscriminate friending — categorizing them all is intimidating to say the least. My guess is that Facebook was merely playing the privacy communication game.
Why are circles effective?
I did an informal poll to see if people are taking advantage of Circles to organize their friend groups. Admittedly, I was looking at a tech-savvy, privacy-conscious group of users, but the response was overwhelming, and it was enough to convince me that Circles will be a success. There’s a lot of excitement among the early user community as they collectively figure out the technology as well as the norms and best practices for Circles. For example, this tip on how to copy a circle has been shared over 400 times as I write this.
One obvious explanation is that Circles captures real-life boundaries, and this is what users have been waiting for all along. That’s no doubt true, but I think there’s more to it than that. Multiple people have pointed out how the exemplary user interface for creating circles encouraged them to explore the feature. It is gratifying to see that Google has finally learned the importance of interface and interaction design in getting social right.
There are several other UI features that contribute to the success of Circles. When friending someone, you’re forced to pick one or more circles, instead of being allowed to drop them into a generic bucket and categorize them later. But in spite of this, the UI is so good that I find it no harder than friending on Facebook.
In addition, you have to pick circles to share each post with (but again the interface makes it really easy). Finally, each post has a little snippet that shows who can see it, which has the effect of constantly reminding you to mind the information flow. In short, it is nearly impossible to ignore the Circles paradigm.
The resharing bug
Google+ tries to balance privacy with Twitter-like resharing, which is always going to be tricky. Amusing inconsistencies result if you share a post with a circle that doesn’t include the original poster. A more serious issue, pointed out by many people including an FT blogger, is that “limited” posts can be publicly reshared. To their credit, Google engineers acknowledged it and quickly disabled the feature.
Meanwhile, some have opined that this issue is “totally bogus” and that this is how life works and how email works, in that when you tell someone a secret, they could share it with others. I strongly disagree, for two reasons.
First, this is not how the real world (or even email) works. Someone can repeat a secret you told them in real life, or forward an email, but they typically won’t broadcast it to the whole world. We’re talking about making something public here, something that will be forever associated with your real name and could very well come up in a web search.
Second, user-interface hints are an important and well-established way of nudging privacy-impacting behaviors. If there’s a ‘share’ button with a ‘public’ setting, many users will assume that it is OK to do just that. Twitter used to allow public retweets of protected tweets, and a study found that this had been done millions of times. In response, Twitter removed this ability. The privicons project seeks to embed similar hints in emails.
In other words, the privacy skeptics are missing the point: the goal of the feature is not to try to technologically prevent leakage of protected information, but to better communicate to users what’s OK to share and what isn’t. And in this case, the simplest way to do that is to remove the 1-click ability to share protected content publicly, and instead let users copy-paste if they really want to do that. It would also make sense to remind users to be careful when they’re sharing a limited to their circles, which, I’m happy to see, is exactly what Google is doing.
A window into your circles
Paul Ohm points out that if someone shares content with a set of circles that includes you, you get to see 21 users who are part of those circles, apparently picked at random.  This means that if you look at these lists of 21 over time you can figure out a lot about someone’s circles, and possibly decipher them completely. Note that by default your profile shows a list of users in your circles, but not who’s in which circle, which for most people is significantly more sensitive.
In my view, this is an interesting finding, but not anything Google needs to fix; the feature is very useful (and arguably privacy-enhancing) and the information leakage is an inevitable tradeoff. But it’s definitely something that users would do well to be aware of: the secrecy of your circles is far from bulletproof.
Speaking of which, the network visibility of different users on their profile page confused me terribly, until I realized Google+ is A/B testing that privacy setting! These are the two possibilities you could see when you edit your profile and click the circles area in the left sidebar: A, B. This is very interesting and unusual. At any rate, very few users seem to have changed the defaults so far, based on a random sample of a few dozen profiles.
Identity and distributed social networking
Some people are peeved that Google+ discourages you from participating pseudonymously. I don’t think a social network that wants to target the mainstream and wants to capture real-world relationships has any real choice about this. In fact, I want it to go further. Right now, Google+ often suggests I add someone I’ve already added, which turns out to be because I’ve corresponded with multiple email addresses belonging to that person. Such user confusion could be minimized if the system did some graph-mining to automatically figure out which identities belong to the same person. 
A related question is what this will mean for distributed social networking, which was hailed a year ago as the savior of privacy and user control. My guess is that Google+ will take the wind out of it — Google takeout gives you a significant degree of control over your data. Further, due to the Apple-Twitter integration and the success of Android, the threat of Facebook monopolizing identities has been obliterated; there are at least three strong players now.
Another reason why Google+ competes with distributed social networks: for people worried about the social networking service provider (or the Government) reading their posts, client-side encryption on top of Google+ could work. The Circles feature is exactly what is needed to make encrypted posts viable, because you can make a circle of those who are using a compatible encryption/decryption plugin. At least a half-dozen such plugins have been created over the years (examples: 1, 2), but it doesn’t make much sense to use these over Facebook or Twitter. Once the Google+ developer API rolls out, I’m sure we’ll see yet another avatar of the encrypted status message idea, and perhaps the the n-th time will be the charm.
Two years ago, I wrote that there’s a market case for a privacy-respecting social network to fill Livejournal’s shoes. Google+ seems poised to fulfill most of what I anticipated in that essay; the asymmetric nature of relationships and the ability to present different facets of one’s life to different people are two important characteristics that the two social networks have in common. 
Many have speculated on whether, and to what extent, Google+ is a threat to Facebook. One recurring comparison is Facebook as “ghetto” compared to Plus, such as in this image making the rounds on Reddit, reminiscent of Facebook vs. Myspace a few years ago. This perception of “coolness” and “class” is the single biggest thing Google+ has got going for it, more than any technological feature.
It’s funny how people see different things in Google+. While I’m planning to use Google+ as a Livejournal replacement for protected posts, since that’s what fits my needs, the majority of the commentary has compared it to Facebook. A few think it could replace Twitter, generalizing from their own corner of the Google+ network where people haven’t been using the privacy options. Forbes, being a business publication, thinks LinkedIn is the target. I’ve seen a couple of commenters saying they might use it instead of Yammer, another business tool. According to yet other articles, Flickr, Skype and various other Internet companies should be shaking in their boots. Have you heard the parable of the blind men and the elephant?
In short, Google+ is whatever you want it to be, and probably a better version of it. It’s remarkable that they’ve pulled this off without making it a confusing, bloated mess. Myspace founder Tom Anderson seems to have the most sensible view so far: Google+ is simply a better … Google, in that the company now has a smoother, more integrated set of services. You’d think people would have figured it out from the name!
 I will use the term “privacy” in this article to encompass both senses.
 It’s actually 22 users, including yourself and the poster. It’s not clear just how random the list is; in my perusal, mutual friends seem to be preferentially picked.
 I am not suggesting that Google+ should prevent users from having multiple accounts, although Circles makes it much less useful/necessary to have multiple accounts.
 On the other hand, when it comes to third party data collection, I do not believe that the market can fix itself.
Anonymization, once the silver bullet of privacy protection in consumer databases, has been shown to be fundamentally inadequate by the work of many computer scientists including myself. One of the best defenses is to control the distribution of the data: strong acceptable-use agreements including prohibition of deanonymization and limits on data retention.
These measures work well when outsourcing data to another company or a small set of entities. But what about scientific research and data mining contests involving personal data? Prizes are big and only getting bigger, and by their very nature involve wide data dissemination. Are legal restrictions meaningful or enforceable in this context?
I believe that having participants sign and fax a data-use agreement is much better from the privacy perspective than being able to download the data with a couple of clicks. However, I am sympathetic to the argument that I hear from contest organizers that every extra step will result a big drop-off in the participation rate. Basic human psychology suggests that instant gratification is crucial.
That is a dilemma. But the more I think about it, the more I’m starting to feel that a two-step process could be a way to get the best of both worlds. Here’s how it would work.
For the first stage, the current minimally intrusive process is retained, but the contestants don’t get to download the full data. Instead, there are two possibilities.
- Release data on only a subset of users, minimizing the quantitative risk. 
- Release a synthetic dataset created to mimic the characteristics of the real data. 
For the second stage, there are various possibilities, not mutually exclusive:
- Require contestants to sign a data-use agreement.
- Restrict the contest to a shortlist of best performers from the first stage.
- Switch to an “online computation model” where participants upload code to the server (or make database queries over the network) and obtain results, rather than download data.
Overstock.com recently announced a contest that conformed to this structure—a synthetic data release followed by a semi-final and a final round in which selected contestants upload code to be evaluated against data. The reason for this structure appears to be partly privacy and partly the fact that are trying to improve the performance of their live system, and performance needs to be judged in terms of impact on real users.
In the long run, I really hope that an online model will take root. The privacy benefits are significant: high-tech machinery like differential privacy works better in this setting. But even if such techniques are not employed, although there is the theoretical possibility of contestants extracting all the data by issuing malicious queries, the fact that queries are logged and might be audited should serve as a strong deterrent against such mischief.
The advantages of the online model go beyond privacy. For example, I served on the Heritage Health Prize advisory board, and we discussed mandating a limit on the amount of computation that contestants were allowed. The motivation was to rule out algorithms that needed so much hardware firepower that they couldn’t be deployed in practice, but the stipulation had to be rejected as unenforceable. In an online model, enforcement would not be a problem. Another potential benefit is the possibility of collaboration between contestants at the code level, almost like an open-source project.
 Obtaining informed consent from the subset whose data is made publicly available would essentially eliminate the privacy risk, but the caveat is the possibility of selection bias.
 Creating a synthetic dataset from a real one without leaking individual data points and at the same time retaining the essential characteristics of the data is a serious technical challenge, and whether or not it is feasible will depend on the nature of the specific dataset.
I saw a tweet today that gave me a lot to think about:
A rather intricate example of social adaptation to technology. If I understand correctly, the cousins in question are taking advantage of the fact that liking someone’s status/post on Facebook generates a notification for the poster that remains even if the post is immediately unliked. 
What’s humbling is that such minor features have the power to affect so many, and so profoundly. What’s scary is that the feature is so fickle. If Facebook starts making updates available through a real-time API, like Google Buzz does, then the ‘like’ will stick around forever on some external site and users will be none the wiser until something goes wrong. Similar things have happened: a woman was fired because sensitive information she put on Twitter and then deleted was cached by an external site. I’ve written about the privacy dangers of making public data “more public”, including the problems of real-time APIs. 
As complex and fascinating as the technical issues are, the moral challenges interest me more. We’re at a unique time in history in terms of technologists having so much direct power. There’s just something about the picture of an engineer in Silicon Valley pushing a feature live at the end of a week, and then heading out for some beer, while people halfway around the world wake up and start using the feature and trusting their lives to it. It gives you pause.
This isn’t just about privacy or just about people in oppressed countries. RescueTime estimates that 5.3 million hours were spent worldwide on Google’s Les Paul doodle feature. Was that a net social good? Who is making the call? Google has an insanely rigorous A/B testing process to optimize between 41 shades of blue, but do they have any kind of process in place to decide whether to release a feature that 5.3 million hours—eight lifetimes—are spent on?
For the first time in history, the impact of technology is being felt worldwide and at Internet speed. The magic of automation and ‘scale’ dramatically magnifies effort and thus bestows great power upon developers, but it also comes with the burden of social responsibility. Technologists have always been able to rely on someone else to make the moral decisions. But not anymore—there is no ‘chain of command,’ and the law is far too slow to have anything to say most of the time. Inevitably, engineers have to learn to incorporate social costs and benefits into the decision-making process.
Many people have been raising awareness of this—danah boyd often talks about how tech products make a mess of many things: privacy for one, but social nuances in general. And recently at TEDxSiliconValley, Damon Horowitz argued that technologists need a moral code.
But here’s the thing—and this is probably going to infuriate some of you—I fear that these appeals are falling on deaf ears. Hackers build things because it’s fun; we see ourselves as twiddling bits on our computers, and generally don’t even contemplate, let alone internalize, the far-away consequences of our actions. Privacy is viewed in oversimplified access-control terms and there isn’t even a vocabulary for a lot of the nuances that users expect.
The ignorant are at least teachable, but I often hear a willful disdain for moral issues. Anything that’s technically feasible is seen as fair game and those who raise objections are seen as incompetent outsiders trying to rain on the parade of techno-utopia. The pronouncements of executives like Schmidt and Zuckerberg, not to mention the writings of people like Arrington and Scoble who in many ways define the Valley culture, reflect a tone-deaf thinking and a we-make-the-rules-get-over-it attitude.
Something’s gotta give.
 It’s possible that the poster is talking about Twitter, and by ‘like’ they mean ‘favorite’. This makes no difference to the rest of my arguments; if anything it’s stronger because Twitter already has a Firehose.
 Potential bugs are another reason that this feature is fickle. As techies might recognize, ensuring that a like doesn’t show up after an item is unliked maps to the problem of update propagation in a distributed database, which the CAP theorem proves is hard. Indeed, Facebook often has glitches of exactly this sort—you might notice it because a comment notification shows up and the comment doesn’t, or vice versa, or different people see different like counts, etc.
[ETA] I see this essay as somewhat complementary to my last one on how information technology enables us to be more private contrasted with the ways in which it also enables us to publicize our lives. There I talked about the role of consumers of technology in determining its direction; this article is about the role of the creators.
[Edit 2] Changed the British spelling ‘wilful’ to American.
Thanks to Jonathan Mayer for comments on a draft.
There are many, many things that digital technology allows us to do more privately today than we ever could. Consider:
The ability of marginalized or oppressed individuals to leverage the privacy of online communication tools to unite in support of a cause, or simply to find each other, has been earth-shattering.
- It has played a key role in the ongoing Middle East uprisings. The Internet helps primarily by enabling rapid communication and coordination, but being able to do it covertly—clumsy governmental hacking attempts notwithstanding—is an equally important aspect.
- Clay Shirky tells the story of how some of meetup.com’s most popular groups were (ir)religious communities that don’t find support in broader U.S. culture — Pagans, ex-Jehovah’s witnesses, atheists, etc.
- STD-positive individuals can use online dating sites targeted at their group. Can you imagine the Sisyphean frustration of trying to date offline and find a compatible partner if you have an STD?
In the political realm, the anonymity afforded by Wikileaks is leading to a challenge to the legitimacy of high-level government actors, if not entire governments. Bitcoin is another anonymity technology that shows the potential to have serious political effects. 
Most of us benefit at an everyday level from improved privacy. When we read, search, or buy online, people around us don’t find out about it. This is vastly more private than checking out a book from a library or buying something at a store. 
We’ve benefited not only in our mundane activities, but our kinky ones as well. We take and exchange naked pictures all the time, never having been able to do so back when it involved getting it developed at the store. And slightly over half of us have taken advantage of the fact that “hiding one’s porn” is trivial today compared to the bad old days of magazines.
I could go on—I haven’t even mentioned the uses of Tor or encryption, freely available to anyone willing to invest a little effort—but I’ve made my point. Of course, I’ve only presented one half of the story. The other half, that technology is also allowing us to expose ourselves in ways never before, has been told so many times by so many people, and so loudly, that it is drowning out meaningful conversation about privacy.
Having presented the above evidence, I posit that technology by itself is actually largely neutral with respect to privacy, in that it enhances the privacy of some types of actions and encumbers that of others. Which direction society takes is up to us. In other words, I’m asserting the negation of technological determinism, applied to privacy.
While I do believe that privacy-infringing technologies have been adopted more pervasively than privacy-enhancing ones, I would say that the disparity is far smaller than it is generally thought to be. Why the mismatch in perception? A curious collective cognitive bias. Observe that almost every one of the examples above is generally seen as a new kind of activity enabled by technology whereas they are really examples of technology allowing us to do a familiar activity, but with more privacy (among other benefits).
Another reason for the cognitive bias is our tendency to focus on the dangers and the negatives of technology. Let’s go back do the nude pictures example: just about everyone does it, but only a small number—perhaps 1%?—suffer some harm from it. Like Schneier says, if it’s in the news, don’t worry about it.
To the extent that privacy-infringing technologies have been more successful, it’s a choice we’ve collectively made. Demand for social networking has been so strong that the sector has somehow invented a halfway workable business model, even though it took several tries to get there. But demand for encryption has been so weak that the market never matured enough to make it usable to the general public.
The disparity could be because we don’t know what’s good for us—volumes have been written about this—but it could also be partly because there are costs and benefits to giving up our privacy, and the benefits, in proportion to the costs, are rather higher than is generally made out to be.
Those are all questions worth pondering, but I hope I have convinced you of this: the idea that information technology inherently invades privacy is oversimplified and misleading. If we’re giving up privacy, we have only ourselves to blame.
 Many privacy-enhancing technologies are morally ambiguous. I’m merely listing the ways in which people benefit from privacy, regardless of whether they’re using it for good or evil.
 It is probably true that the Internet has made it easier for government, advertisers etc. to track your activities. But it doesn’t change the fact that there’s a privacy benefit to regular people in an everyday context, who are far more concerned about keeping secrets from their family, friends and neighbors than about abstract threats.
[ETA] This essay examines the role of consumers in shaping the direction of technology, whereas the next one looks at the role of creators.
Thanks to Ann Kilzer for comments on a draft.
I have a new paper titled “You Might Also Like:” Privacy Risks of Collaborative Filtering with Joe Calandrino, Ann Kilzer, Ed Felten and Vitaly Shmatikov. We developed new “statistical inference” techniques and used them to show how the public outputs of online recommender systems, such as the “You Might Also Like” lists you see on many websites, can reveal individual purchases and preferences. Joe spoke about it at the IEEE S&P conference at Oakland earlier today.
Background: inference and statistical inference. The paper is about techniques for inference. At its core, inference is a simple concept, and is about deducing that some event has occured based on its effect on other observable events or objects, often seemingly unrelated. Think Sherlock Holmes, whether something simple such as the idea of a smoking gun, now so well known that it’s a cliché, or something more subtle like the curious incident of the dog in the night time.
Today, inference has evolved a great deal, and in our data-rich world, inference often means statistical inference. Detection of extrasolar planets is a good example of making deductions from the faintest clues: A planet orbiting a star makes the star wobble slightly, which affects the velocity of the star with respect to the Earth. And this relative velocity can be deduced from the displacement in the parent star’s spectral lines due to the Doppler effect, thus inferring the existence of a planet. Crazy!
Web privacy. But back to the paper: what we did was to develop and apply inference techniques in the web context, specifically recommender systems, in a way that no one had thought of before. As you may have noticed, just about every website publicly shows relationships between related items—products, videos, books, news articles, etc.— and these relationships are derived from purchases or views, which are private information. What if the public listings could be reverse engineered, so that we can infer a user’s purchases from them? As the abstract says:
Many commercial websites use recommender systems to help customers locate products and content. Modern recommenders are based on collaborative filtering: they use patterns learned from users’ behavior to make recommendations, usually in the form of related-items lists. The scale and complexity of these systems, along with the fact that their outputs reveal only relationships between items (as opposed to information about users), may suggest that they pose no meaningful privacy risk.
In this paper, we develop algorithms which take a moderate amount of auxiliary information about a customer and infer this customer’s transactions from temporal changes in the public outputs of a recommender system. Our inference attacks are passive and can be carried out by any Internet user. We evaluate their feasibility using public data from popular websites Hunch, Last.fm, LibraryThing, and Amazon.
Consider a user Alice who’s made numerous purchases, some of which she has reviewed publicly. Now she makes a new purchase which she considers sensitive. But this new item, because of her purchasing it, has a nonzero probability of entering the “related items” list of each of the items she has purchased in the past, including the ones she has reviewed publicly. And even if it is already in the related-items list of some of those items, it might improve its rank on those lists because of her purchase. By aggregating dozens or hundreds of these observations, the attacker has a chance of inferring that Alice purchased something, as well as the identity of the item she purchased.
It’s a subtle technique, and the paper has more details than you can shake a stick at if you want to know more.
We evaluated the attacks we developed against several websites of a diverse nature. Numerically, our best results are against Hunch, a recommendation and personalization website. There is a tradeoff between the number of inferences and their accuracy. When optimized for accuracy, our algorithm inferred a third of the test users’ secret answers to Hunch questions with no error. Conversely, if asked to predict the secret answer to every secret question, the algorithm had an accuracy of around 80%.
Impact. It is important to note that we’re not claiming that these sites have serious flaws, or even, in most cases, that they should be doing anything different. On sites other than Hunch—Hunch had an API that provided exact numerical correlations between pairs of items—our attacks worked only on a small proportion of users, although it is sufficient to demonstrate the concept. (Hunch has since eliminated this feature of the API, for reasons unrelated to our research.) We also found that users of larger sites are much safer, because the statistical aggregates are computed from a larger set of users.
But here’s why we think this paper is important:
- Our attack applies to a wide variety of sites—essentially every site with an online catalog of some sort. While we discuss various ways to mitigate the attack in the paper, there is no bulletproof “fix.”
- It undermines the widely accepted dichotomy between “personally identiﬁable” individual records and “safe,” large-scale, aggregate statistics. Furthermore, it demonstrates that the dynamics of aggregate outputs (i.e., their variation with time) constitute a new vector for privacy breaches. Dynamic behavior of high-dimensional aggregates like item similarity lists falls beyond the protections offered by any existing privacy technology, including differential privacy.
- It underscores the fact that modern systems have vast “surfaces” for attacks on privacy, making it difﬁcult to protect ﬁne-grained information about their users. Unintentional leaks of private information are akin to side-channel attacks: it is very hard to enumerate all aspects of the system’s publicly observable behavior which may reveal information about individual users.
That last point is especially interesting to me. We’re leaving digital breadcrumbs online all the time, whether we like it or not. And while algorithms to piece these trails together might seem sophisticated today, they will probably look mundane in a decade or two if history is any indication. The conversation around privacy has always centered around the assumption that we can build technological tools to give users—at least informed users—control over what they reveal about themselves, but our work suggests that there might be fundamental limits to those tools.
See also: Joe Calandrino’s post about this paper.
I had a fun and engaging discussion on the “Paying With Data” panel at the South by Southwest conference; many thanks to my co-panelists Sara Marie Watson, Julia Angwin and Sam Yagan. I’d like to elaborate here on a concept that I briefly touched upon during the panel.
The market for lemons
In a groundbreaking paper 40 years ago, economist George Akerlof explained why so many used cars are lemons. The key is “asymmetric information:” the seller of a car knows more about its condition than the buyer does. This leads to “adverse selection” and a negative feedback spiral, with buyers tending to assume that there are hidden problems with cars on the market, which brings down prices and disincentivizes owners of good cars from trying to sell, further reinforcing the perception of bad quality.
In general, a market with asymmetric information is in danger of developing these characteristics: 1. buyers/consumers lack the ability to distinguish between high and low quality products 2. sellers/service providers lose the incentive to focus on quality and 3. the bad gradually crowds out the good since poor-quality products are cheaper to produce.
Information security and privacy suffer from this problem at least as much as used cars do.
The market for security products and certification
Bruce Schneier describes how various security products, such as USB drives, have turned into a lemon market. And in a fascinating paper, Ben Edelman analyzes data from TRUSTe certifications and comes to some startling conclusions [emphasis mine]:
Widely-used online “trust” authorities issue certifications without substantial verification of recipients’ actual trustworthiness. This lax approach gives rise to adverse selection: The sites that seek and obtain trust certifications are actually less trustworthy than others. Using a new dataset on web site safety, I demonstrate that sites certified by the best-known authority, TRUSTe, are more than twice as likely to be untrustworthy as uncertified sites. This difference remains statistically and economically significant when restricted to “complex” commercial sites.
TRUSTe’s “Watchdog Reports” also indicate a lack of focus on enforcement. TRUSTe’s postings reveal that users continue to submit hundreds of complaints each month. But of the 3,416 complaints received since January 2003, TRUSTe concluded that not a single one required any change to any member’s operations, privacy statement, or privacy practices, nor did any complaint require any revocation or on-site audit. Other aspects of TRUSTe’s watchdog system also indicate a lack of diligence.
The market for personal data
In the realm of online privacy and data collection, the information asymmetry results from a serious lack of transparency around privacy policies. The website or service provider knows what happens to data that’s collected, but the user generally doesn’t. This arises due to several economic, architectural, cognitive and regulatory limitations/flaws:
- Each click is a transaction. As a user browses around the web, she interacts with dozens of websites and performs hundreds of actions per day. It is impossible to make privacy decisions with every click, or have a meaningful business relationship with each website, and hold them accountable for their data collection practices.
- Technology is hard to understand. Companies can often get away with meaningless privacy guarantees such as “anonymization” as a magic bullet, or “military-grade security,” a nonsensical term. The complexity of private browsing mode has led to user confusion and a false sense of safety.
- Privacy policies are filled with legalese and no one reads them, which means that disclosures made therein count for nothing. Yet, courts have upheld them as enforceable, disincentivizing websites from finding ways to communicate more clearly.
Collectively, these flaws have led to a well-documented market failure—there’s an arms race to use all means possible to entice users to give up more information, as well as to collect it passively through ever-more intrusive means. Self-regulatory organizations become captured by those they are supposed to regulate, and therefore their effectiveness quickly evaporates.
TRUSTe seems to be up to some shenanigans the online tracking space as well. As many have pointed out, the TRUSTe “Tracking Protection List” for Internet Explorer is in fact a whitelist, allowing about 4,000 domains—almost certainly from companies that have paid TRUSTe—to track the user. Worse, installing the TRUSTe list seems to override the blocking of a domain via another list!
The obvious response to a market with asymmetric information is to correct the information asymmetry—for used cars, it involves taking it to a mechanic, and for online privacy, it is consumer education. Indeed, the What They Know series has done just that, and has been a big reason why we’re having this conversation today.
However, I am skeptical that the market can be fixed though consumer awareness alone. Many of the factors I’ve laid out above involve fundamental cognitive limitations, and while consumers may be well-educated about the general dangers prevalent online, it does not necessarily help them make fine-grained decisions.
It is for these reasons that some sort of Government regulation of the online data-gathering ecosystem seems necessary. Regulatory capture is of course still a threat, but less so than with self-regulation. Jonathan Mayer and I point out in our FTC Comment that ad industry self-regulation of online tracking has been a failure, and argue that the FTC must step in and enforce Do Not Track.
In summary, information asymmetry occurs in many markets related to security and privacy, leading in most cases to a spiraling decline in quality of products and services from a consumer perspective. Before we can talk about solutions, we must clearly understand why the market won’t fix itself, and in this post I have shown why that’s the case.
Update. TRUSTe president Fran Maier responds in the comments.
Thanks to Jonathan Mayer for helpful feedback.
Privacy norms, rules and expectations in the real world go far beyond the “public/private” dichotomy. Yet in the realm of web crawler access control, we are tied to this binary model via the robots.txt allow/deny rules. This position paper describes some of the resulting problems and argues that it is time for a more sophisticated standard.
The problem: privacy of public data. The first author has argued that individuals often expect privacy constraints on data that is publicly accessible on the web. Some examples of such constraints relevant to the web-crawler context are:
- Data should not be archived beyond a certain period (or at all).
- Crawling a small number of pages is allowed, but large-scale aggregation is not.
- “Linkage” of personal information to other databases is prohibited.
Currently there is no way to specify such restrictions in a machine-readable form. As as result, sites resort to hacks such as identifying and blocking crawlers whose behavior they don’t like, without clearly defining acceptable behavior. Other sites specify restrictions in the Terms of Service and bring legal action against violators. This is clearly not a viable solution — for operators of web-scale crawlers, manually interpreting and encoding the ToS restrictions of every site is prohibitively expensive.
There are two reasons why the problem has become pressing: first, there is an ever-increasing quantity of behavioral data about users that is valuable to marketers — in fact, there is even a black market for this data — and second, crawlers have become very cheap to set up and operate.
The desire for control over web content is by no means limited to user privacy concerns. Publishers concerned about copyright are equally in search of a better mechanism for specifying fine-grained restrictions on the collection, storage and dissemination of web content. Many site owners would also like to limit the acceptable uses of data for competitive reasons.
The solution space. Broadly, there are three levels at which access/usage rules may be specified: site-level, page-level and DOM element-level. Robots.txt is an example of a site-level mechanism, and one possible solution is to extend robots.txt. A disadvantage of this approach, however, is that the file may grow too large, especially in sites with user-generated content what may wish to specify per-user policies.
A page-level mechanism thus sounds much more suitable. While there is already a “robots” attribute to the META tag, it is part of the robots.txt specification and has the same limitations on functionality. A different META tag is probably an ideal place for a new standard.
Taking it one step further, tagging at the DOM element-level using microformats to delineate personal information has also been proposed. A possible disadvantage of this approach is the overhead of parsing pages that crawlers will have to incur in order to be compliant.
Conclusion. While the need to move beyond the current robots.txt model is apparent, it is not yet clear what should replace it. The challenge in developing a new standard lies in accommodating the diverse requirements of website operators and precisely defining the semantics of each type of constraint without making it too cumbersome to write a compliant crawler. In parallel with this effort, the development of legal doctrine under which the standard is more easily enforceable is likely to prove invaluable.
This article starts from the example of a simple privacy mishap and argues that the flawed thinking it exposes is a symptom of a deeper malaise and that the structure of privacy research in computer science might require rethinking.
I was surprised by a statement in a recent blog post by Geni, a genealogy-based social networking site, that plainly asserted, “following does not have any privacy implications.” This was in reference to the feature to “follow” a user or profile on the site, which among other things notifies you instantly of new information or activity about the person. (Admirably, however, Geni listened to their users and made some changes to the feature.)
Of course following has privacy implications. Without the follow feature — not just on Geni but on virtually every site that provides an equivalent capability — to obtain the same level of up-to-date information about a person, you’d have to either sit around constantly refreshing their profile or else write a bot that will do that for you and notify you of any updates by email. It is precisely because of this vast difference in the ease of keeping track of people that there was a backlash when Facebook introduced News Feed several years ago.
Why then would anyone claim that following has no privacy implications? The culprit here is “adversarial thinking,” an analytical process that computer scientists and security engineers are trained in. Under this paradigm, users are viewed as all-powerful “adversaries” (limited only by the fundamental computational limits of nature), typically interested in learning as much information about everyone as possible. Clearly, if everyone is an “adversary,” the follow feature makes not a whit of difference, since anyone could create and operate the bot mentioned above with no effort at all.
Weird as it may seem to the uninitiated, adversarial thinking is second nature to computer scientists. It is adversarial thinking that leads to the formulation of privacy as an access-control problem, something that I’ve criticized; the Geni blog post explicitly mentions this as their formulation of privacy. Privacy-as-access-control makes for neat papers but tends to break down quickly in the real world.
Let me be clear: adversarial thinking is a deep and valuable skill that is indispensable in the context that it is meant for — designing cryptosystems. However, it is not always the right paradigm in the privacy context. The theoretical study of database privacy seems to be doing rather well by borrowing methods from cryptography, and I’ve argued in support of adversarial thinking therein. On the other hand, social networking privacy falls squarely in the class of studies in which I find the adversarial approach to have limited value.
There’s a bigger take-away here: the structure of privacy research within computer science might require rethinking. Privacy is currently not considered a first-rate topic but is instead a side-interest of different communities such as security, cryptography and databases/datamining. As a result of this lack of primacy, not only do we frequently use the wrong methods — when all you’ve got is a hammer, everything looks like a nail — we’re also missing out on the chance to borrow from the literature on privacy in fields like law, economics, sociology, and human-computer interaction.
 This is not the only reason why the follow feature has privacy implications. On Livejournal, being followed by people with offensive usernames is sometimes a problem, compounded by the fact that due to the UI, it is not obvious who is following whom. In fact, the privacy changes made by Geni seem intended to address roughly this type of concern rather than the ease-of-tracking issue.
 While the term adversary is standard, adversarial thinking is a term I’ve coined here to describe a somewhat loose collection of axioms (including, for example, Kerckhoff’s principle) that constitute the dominant paradigm of cryptography/security. I don’t think there is an extant term; I’d love to be corrected.
Thanks to Aleksandra Korolova for comments on a draft.