Data Privacy: The Story of a Paradigm Shift
Let’s take a break from the Ubercookies series. I’m at the IPAM data privacy workshop in LA, and I want to tell you about the kind of unusual scientific endeavor that it represents. I’ve recently started to write about the process of doing science, what’s good and what’s bad about it, and I expect to have more to say on this topic in this blog.
While “paradigm shift” has become a buzzword, the original sense in which Kuhn used it refers to a specific scientific process. I’ve had the rare experience of witnessing such a paradigm shift unfold, and I may even have played a small part. I am going to tell that story. I hope it will give you a “behind-the-scenes” look into how science works.
I will sidestep the question of whether data privacy is a science. I think it is a science to the extent that computer science is a science. At any rate, I think this narrative provides a nice illustration of Kuhn’s ideas.
First I need to spend some time setting up the scene and the actors. (I’m going to take some liberties and simplify things for the benefit of the broader audience, and I hope my colleagues will forgive me for it.)
The scene. Privacy research is incredibly multidisciplinary, and this workshop represents one extreme of the spectrum: the math behind data privacy. The mathematical study of privacy in databases centers on one question:
If you have a bunch of data collected from individuals, and you want to let other people do something useful with the data, such as learning correlations, how do you do it without revealing individual information?
There are roughly 3 groups that investigate this question and are represented here:
- computer scientists with a background in cryptography / theoretical CS
- computer scientists with a background in databases and data mining
- statisticians.
This classification is neither exhaustive nor strict, but it will suffice for my current purposes.
One of the problems with science and math research is that different communities studying different aspects of the the the same problem (or even studying the same problem from different perspectives) don’t meet together very often. For one, there is a good deal of friction in overcoming the language barriers (different names/ways of thinking about the same things). For another, academics are rewarded primarily for publishing in their own communities. That is why the organizers deserve a ton of credit for bridging the barriers and getting people together.
The paradigms. There is a fundamental, inescapable tension between the utility of data and the privacy of the participants. That’s the one thing that theorists and practitioners can agree on
Given that fact, there are two approaches to go about building a theory of privacy-protection, which I will call utility-first and privacy-first. Statisticians and database people tend to prefer the former paradigm, and cryptographers the latter; but this is not a clean division.
Utility-first hopes to be able to preserve the statistical computations that we would want to do if we didn’t have to worry about privacy, and then ask, “how can we improve the privacy of participants while still doing all these things?” Data anonymization is one natural technique that comes out of this world view: if you are only doing simple syntactic transformations to the data, the utility of the data is not affected very much.
On the other hand, privacy-first says, “let’s first figure out a rigorously provable way to assure the privacy of participants, and then go about figuring out what are the types of computations that can be carried out under this rubric.” The community has collectively decided, with good reason, that differential privacy is the right rubric to use. To explain it properly would require many Greek symbols, so I won’t.
Privacy-first and utility-first are scientific paradigms, not theories. Neither is falsifiable. We can say that one is better, but that is a judgement.
An important caveat must be noted here. The terms do not refer to the social values of putting the utility of the data before the privacy of the participants, or vice versa. Those values are external to the model and are constraints enforced by reality. Instead, we are merely talking about which paradigm gives us better analytical techniques to achieve both the utility and privacy requirements to the extent possible.
The shift. With utility-first, you have strong, well-understood guarantees on the usefulness of the data, but typically only a heuristic analysis of privacy. What this translates to is an upper bound on privacy. With privacy-first, you have strong, well-understood privacy guarantees, but you only know how to perform certain types of computations on the data. So you have a lower bound on utility.
That’s where things get interesting. Utility-first starts to look worse as time goes on, as we discover more and more inferential techniques for breaching the privacy of participants. Privacy-first starts to look better with time, as we discover that more and more types of data-mining can be carried out due to innovative algorithms. And that is exactly how things have played out over the last few years.
I was at a similarly themed workshop at Bertinoro, Italy back in 2005, with much the same audience in attendance. Back then, the two views were about equally prevalent; the first papers on differential privacy were being written or had just been written (of course, the paradigm itself was not new). Fast forward 5 years, and the proponents of one view have started to win over the other, although we quibble to no small extent over the details. Overall, though, the shift has happened in a swift and amicable way, with both sides now largely agreeing on differential privacy.
Why did privacy-first win? I can see many reasons. The privacy protections of the utility-first techniques kept getting broken (a Kuhnian “crisis”?); the de-anonymization research that I and others worked on played a big part here. Another reason might be the way the cryptographic community operates: once they decide that a paradigm is worth investigating, they tend to jump in on it all at once and pick the bones clean. That ensured that within a few years, a huge number of results of the form “how to compute X with differential privacy” were published. A third reason might very well be the fact that these interdisciplinary workshops exist, giving us an opportunity to change each other’s minds.
The fallout. While the debate in theoretical circles seems largely over, the ripple effects are going to be felt “downstream” for a long time to come. Differential privacy is only slowly penetrating other areas of research where privacy is a peripheral but not a fundamental object of study. As for law and policy, Ohm’s paper on the failure of anonymization has certainly created a bang there.
That leaves the most important contingent: practitioners. Technology companies have been quick to learn the lessons — differential privacy was invented by Microsoft researchers — and have been studying questions like sharing search logs with differential privacy assurances and building programming systems incorporating differential privacy (see PINQ developed at Microsoft Research and Airavat funded by Google.)
Other sectors, especially medical informatics, have been far slower to adapt, and it is not clear if they ever will. Multiple speakers at this workshop dealing with applications in different sectors talked about their efforts at anonymizing high-dimensional data (good luck with that). The problems are compounded by the fact that differential privacy isn’t yet at a point where it is easily usable in applications and in many cases the upshot of the theory has been to prove that the simultaneous utility and privacy requirements simply cannot be met. It will probably be the better part of a decade before differential privacy starts to make any real headway into real-world usage.
Summary. I hope I’ve shown you what scientific “paradigms” are, how they are adopted and discarded. Paradigm shifts are important turning points for scientific disciplines and often have big consequences for society as a whole. Finally, science is not a cold sequence of deductions but is done by real people with real motivations; the scientific process has a significant social and cultural component, even if the output of science is objective.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
6 comments February 25, 2010
Google Docs Identity Leak Bug Fixed
Yesterday I wrote about a bug in Google Docs that lets an arbitrary website find your identity. This morning I woke up to this piece of good news in my Inbox:
The fix is pushed out and live for all users as of the middle of last night. Basically we only show the username of collaborators if they are explicitly listed on the ACL of the spreadsheet. Otherwise we call them “Anonymous user”. This means that an editor of the document had to already know the username in order for that username to be visible to collaborators.
I can confirm that the demo page no longer finds my identity. And the spreadsheet in my last post now looks like this:
The Google Docs help question “Collaborating: Why are some users anonymous?” explains:
If a document is set by the owner to be viewable or editable by everyone, then Google Docs does not show the names of those who choose to view or edit the document. Google Docs displays only the identities of users who are explicitly given permission to view or edit a document (either individually or as part of a group).
You might wonder what happens if the attacker explicitly gives permission to a whole bunch of users (say using scraped email addresses) . There seems to be an extra level of protection now:
Sounds like a happy resolution.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
7 comments February 23, 2010
How Google Docs Leaks Your Identity
Recap. In the previous two articles in this Ubercookies series, I showed how an arbitrary website that you visit can learn your identity using the “history stealing” bug in web browsers. In this article I will show how a bug in Google Docs gives any website the same capability in a far easier manner.
Update. A Google Docs team member tells me that a fix should be live later today.
Update 2. Now fixed.
About six weeks ago I discovered that a feature/bug in Google docs can be used to mass harvest e-mail addresses. I noted it in my journal, but soon afterwards I realized that it was much worse: you could actually discover the identity of web visitors using the bug. Recently, Vincent Toubiana and I implemented the attack; here is a video of the demo webpage (on my domain, in no way related to Google) just to show that we got it working.
(You might need to hit pause to read the text.)
I’m not releasing the live demo, since the vulnerability unfortunately still exists (more on this below). Let us now study the attack in more detail.

Bug or feature? Google Spreadsheets has a feature that tells you who else is editing the document. It’s actually really nifty: you can see in real time who is editing which cell, and it even seems to have live chat. The problem is that this feature is available even for publicly viewable documents. Do you see where this is going?
First of all, this is a problem even without the surreptitious use I’m going to describe. Here’s a public spreadsheet I found with 10 seconds of Googling that a few people seem to be viewing when I looked. I’m not sure the author of this document intended it to be publicly viewable or editable.
The attack works by embedding an invisible iframe (dimensions 0×0) into the malicious web page. The iframe loads a public spreadsheet that the attacker has already created. In a separate backend process, the attacker constantly checks the list of people viewing the spreadsheet and records this information. After the iframe is embedded, the Javascript on the page page waits a second or two and queries the attacker’s server to get the username of the user who most recently appeared on the list.
What if multiple people are visiting the page at roughly the same time? It’s not a problem, for two reasons: 1. Google Spreadsheets has a “push” notification system for updating the frontend which enables the attacker to get the identity of the new user virtually instantaneously. 2. To further increase accuracy, the attacker can create (say) 10 spreadsheets and embed a random subset of 5 into any given visitor’s page, making it exceptionally unlikely that there will be a collision.
The only inefficient part of the attack as Toubiana and I have implemented it is that it requires a browser (with a GUI) to be open to monitor the spreadsheet. Browser rendering engines have been modularized into scriptable components, so with a little more effort it should be possible to run this without a display. At present I have it running out of an old laptop tucked away in my dresser
The “backend” is in there
Defense. How can Google fix this bug? There are stop-gap measures, but as far as I can see the only real solution is to disable the collaborator list for public documents. Again a trade-off between functionality and privacy as we saw in the previous article.
Many people responded to my original post saying they were going to stay logged out of Google when they didn’t need to be logged in (since you can’t log out of just Google Docs separately). Unfortunately, that’s not a feasible solution for me, and I suspect many other people. There are at least 3 Google services that I constantly need to keep tabs on; otherwise my entire workflow would come to a screeching halt. So I just have to wait for Google to do something about this bug. Which brings me to my next point:
Great power, great responsibility. There is a huge commercial benefit to becoming an identity provider. As Michael Arrington has repeatedly noted, many Internet companies issue OpenIDs but don’t accept them from other providers, in a race to “own the identity” of as many users as possible. That is of course business as usual, but the players in this race need to wake up to the fact that being an identity provider is asking users for a great deal of trust, whether or not users realize it.
An identity-stealing bug is an (unintentional) violation of that trust because — among many other reasons — it is a precursor to stealing your actual account credentials. (That is particularly scary with Google due to their lack of anything resembling customer service for account issues.) One strategy for stealing account credentials is a phishing page mimicking the Google login page, with your username filled in. Users are much less likely to be suspicious and more likely to respond to messages that have their name on them. Research on social phishing reaches similar conclusions.
I’ve been in contact with people at Google about this bug and I’ve been told a fix is being worked on, specifically that “less presence information will be revealed.” I take it to mean the attack described here won’t work. Since they are making a good-faith effort to fix it, I’m not releasing the demo itself. It has been a long time, though. The Buzz privacy issues were fixed in 4 days, and that kind of urgency is necessary for security issues of this magnitude.
A kind of request forgery. The attack here can be seen as a simpleminded cross-site request forgery. In general, any type of request forgery bug that causes your browser to initiate a publicly recorded interaction on your behalf will immediately leak you identity. For example, if (hypothetically) visiting a URL causes your browser to leave a comment on a specific Youtube video, then the attacker can create a Youtube video and constantly monitor it for comments, mirroring the attack technique used here.
Another technical lesson from this bug is that access control in social networking can be tricky. I’ve written before that privacy in social networking is about a lot more than access control, and that theory doesn’t help determine user reactions to your product. But this bug was an access control issue, and theory would have helped. Websites designing social features would do well to have someone with an academic background thinking about security issues.
Up next. In this post as well as the previous ones, I’ve briefly hinted at what exactly can go wrong if websites can learn your identity. The next post in this series will examine that issue in more detail. Stay tuned — it turns out there’s quite a bit more to say about that, and you might be surprised.
Thanks to Vincent Toubiana for reviewing a draft.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
8 comments February 22, 2010
Ubercookies Part 2: History Stealing meets the Social Web
Recap. In the previous article I introduced ubercookies — techniques that websites can use to de-anonymize visitors. I discussed a recent paper that shows how to use history stealing along with social network group membership information to find the visitor’s identity, and I promised a stronger variant of the attack.
The observation that led me to the attack I’m going to describe is simple: social networking isn’t just about social networks — the whole web has gone social. It’s a view that you quickly internalize if you spend any time hanging out with Silicon Valley web entrepreneurs
Let’s break the underlying principle of the identity-stealing attack down to its essence:
A user leaves a footprint whenever their interaction with a specific web page is recorded publicly.
De-anonymization happens when the attacker can tie these footprints together into “trails” that can then be correlated with the user’s browser history. Efficiently querying the history to identify multiple points on the trail is a challenging problem to solve, but in principle de-anonymization is possible as long as the user’s actions on different web pages happen under the same identity.
Footprints can be tied together into trails as long as all the interactions happen under the same identity. There is no need for the interactions to be on the same website.
There are two major ways in which you can interact with arbitrary websites under a unified identity, both of which are defining principles of the social web. The first is federated identity, which means you can use the same identity provider wherever you go. This is achieved through OpenID and similar mechanisms like Facebook Connect. The second is social sharing: whenever you find something interesting anywhere on the web, you feed it back to your social network.
Now let’s examine the different types of interactions in more detail.
A taxonomy of interaction on the social web.
0. The pre-social web had no social networks and no delegated identity mechanism (except for the failed attempt by Microsoft called Passport). Users created new identities on each website, authenticated via site-specific usernames and passwords to each site separately. The footprints on different sites cannot be tied together; for practical purposes there are no footprints.
1. Social networks: affiliation. In social networks, users interact with social objects and leave footprints when the actions are public. The key type of interaction that is useful for de-anonymization is the expression of affiliation: this refers to not just the group memberships studied in the recent Wondracek et al. paper, but also includes
- memberships of fan pages on Facebook
- “interests” on Livejournal
- follow relationships and plain old friend relationships on Twitter and other public social networks
- subscriptions to Youtube channels
and so on.
All of these interactions, albeit very different from the user perspective, are fundamentally the same concept:
- you “add yourself” to or affiliate yourself with some object on a social network
- this action can be publicly observed
- you almost certainly visited a URL that identifies the object before adding it.
2. The social web: sharing. When you find a page you like — any page at all — you can import it or “share” it to your social stream, on Facebook, Twitter, Google Buzz, or a social bookmarking site like Delicious. The URL of the page is almost certainly in your history, and as long as your social stream is public, your interaction was recorded publicly.
3. The social web: federated identity. When you’re reading a blog post or article on the social web, you can typically comment on it, “like” it, favorite it, rate it, etc. You do all this under your Facebook, Google or other unified identity. These actions are often public and when they are, your footprint is left on the page.
A taxonomy of attacks
The three types of social interactions above give rise to a neat taxonomy of attacks. They involve progressively easier backend processing and progressively more sophisticated history search techniques on the front end. But the execution time on the front-end doesn’t increase, so it is a net win. Here’s a table:
| Type of interaction |
Backend processing |
Type of history URL |
Location of footprint |
| Affiliation | Crawling of social network | Object in a social network | In the social network |
| Sharing | Syndication of social stream(s) from social network | Any page | In the social network |
| Federated identity | None; optional crawling | Any page | On the page |
.
1. Better use of affiliation information. The Wondracek et al. paper makes use of only group membership. One natural reason to choose groups is that there are many groups that are large, with thousands of members, so it gives us a reasonably high chance that by throwing darts in the browser history we will actually hit a few groups that the user has visited. On the other hand, if we try to use the Facebook friend list, for example, hoping to find one of the user’s friends by random chance, it probably won’t work because most users have only a few hundred friends.
But wait: many Twitter users have thousands or even millions of followers. These are known as “hubs” in network theory. Clearly, the attack will work for any kind of hubs that have predictable URLs, and users on Twitter have even more predictable URLs (twitter.com/username) than groups on various networks. The attack will also work using Youtube favorites (which show up by default on the user’s public profile or channel page) and whatever other types of affiliation we might choose to exploit, as long as there are “hubs” — nodes in the graph with high degree. Already we can see that many more websites are vulnerable than the authors envisaged.
2. Syndicating the social stream: my Delicious experiment.
The interesting thing about the social stream is that you can syndicate the stream of interactions, rather than crawling. The reasons why syndication is much easier than crawling are more practical than theoretical. First, syndicated data is intended to be machine readable, and is therefore smaller as well as easier to parse compared to scraping web pages. Second, and more importantly, you might be to get a feed of the entire site-wide activity instead of syndicating each user’s activity stream separately. Delicious allows global syndication; Twitter plans to open this “firehose” feature to all developers soon.
Another advantage of the social stream is that everything is timestamped, so you can limit yourself to recent interactions, which are more likely to be in the user’s history.
Using the delicious.com dataset made available by DAI-labor (a log of all bookmarking activity on delicious.com over several years), I did a simulated experiment using 3 months worth of data: assuming that users keep their history around for 3 months, do in fact visit every link they post on delicious, how many users would a hypothetical history stealing attack be able to identify? I had a pretty good success rate: about 60% of the users who had shared at least 2 links in the 3-month period, or about 300,000 users. This takes at most 4000-5000 Javascript history queries.
Needless to say, once Twitter opens up its firehose, Twitter users (who are far more numerous than delicious users) would also be susceptible to the same technique.
This attack is not possible to fix via server-side URL randomization. It can also be made to work using Facebook, Google Buzz, and other sharing platforms, although the backend processing required won’t be as trivial (but probably no harder than in the original attack.)
3. A somewhat random walk through the history park.
And now for an approach that potentially requires no backend data collection, although it is speculative and I can’t guess what the success rate would be. The attack proceeds in several steps:
- Identify the user’s interests by testing if they’ve visited various popular topic-specific sites. Pick one of the user’s favorite topics. Incidentally, a commenter on my previous post notes he is building exactly this capability using topic pages on Wikipedia, also with the goal of de-anonymization!
- Grab a list of the top blogs on the topic you picked from one of the blog directories. Query the history to see which of these blogs the user reads frequently. It is even possible to estimate the level of interest in a blogs by looking at the fraction of the top/recent posts from that blog that the user has visited. Pick a blog that the user seems to visit regularly.
- Look for evidence of the user leaving comments on posts. For example, on Blogger, the comment page for a post has the URL http://www.blogger.com/comment.g?blogID=<blogid>&postID=<postid>.
- Once you find a couple of posts where it looks like the user made a comment, scrape the list of people who commented on it, find the intersection. (Even a single comment might suffice; as long as you have a list of candidates, you easily verify if it’s one of them by testing user-specific URLs. More below.)
- Depending on the blogging platform, you might even be able to deduce that the user responded (or intended to respond) to a specific comment. For example, On wordpress you have the pattern http://<blogname>.wordpress.com/<postname>/?replytocom=<commentid>#respond. If you get lucky and find one of those patterns, that makes things even easier.
If at first you don’t succeed, pick a different blog and repeat.
I suspect that the most practical method would be to use a syndicated activity stream from a social network, but also to use the heuristics presented above to more efficiently search through the history.
Epilogue: Identity.
Not only has there been a movement towards a small number of identity providers on the web, there are many aggregators out there that have sprung up in order to automatically find the connections between identities across the different identity providers, and also connect online identities to physical-world databases. As Pete Warden notes:
One of the least-understood developments of the last few years is the growth of databases of personal information linked to email addresses. Rapleaf is probably the leader in this field, but even Flickr lets companies search their API for users based on an email address.
I ran my email address through his demo script and it is quite clear that virtually all of my online identities have been linked together. This is getting to be the norm; as a consequence, once an attacker gets any kind of handle on you, they can go “identity hopping” and find out a whole lot more about you.
This is also the reason that once the attacker can make a reasonable guess at the visitor’s identity, it’s easy to verify the guess. Not only can they look for user-specific URLs in your history to confirm the guess (described in detail in the Wondracek et al. paper), but all your social streams on other sites can also be combined with your history to corroborate your identity.
Up next in the Ubercookies series: So that’s pretty bad. But it’s going to get worse before it can get better
In the next article, I will describe an entirely different attack strategy to get at your identity by exploiting a bug in a specific identity provider’s platform.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
3 comments February 19, 2010
Cookies, Supercookies and Ubercookies: Stealing the Identity of Web Visitors
Synopsis. Highly sticky techniques called supercookies for tracking web visitors are becoming well known. But the state of the art has in fact advanced beyond tracking, allowing a website to learn the identity of a visitor. I will call these techniques ubercookies; this article describes one such recently discovered technique. Future articles in this series will describe even more powerful variants and discuss the implications.
Cookies. Most people are aware that their web browsing activity over time and across sites can be tracked using cookies. When you are being tracked, it can be deduced that the same person visited certain sites at certain times, but the sites doing the tracking don’t know who you are, i.e., you name, etc., unless you choose to tell them in some way, such as by logging in.
Cookies are easy to delete, and so there’s been a big impetus in the Internet advertising industry to discover and deploy more robust tracking mechanisms.
Supercookies. You may surprised to find just how helpless a user is against a site (or more usually, a network of sites) that is truly determined to track them. There are Flash cookies, much harder to delete, some of which respawn the regular HTTP cookies that you delete. The EFF’s Panopticlick project demonstrates many “browser fingerprinting” methods which are more sophisticated. (Jonathan Mayer’s senior thesis contained a smaller-scale demonstration of some of those techniques).
A major underlying reason for a lot of these problems is that any browser feature that allows a website to store “state” on the client can be abused for tracking, and there are a bewildering variety of these. There is a great analysis in a paper by my Stanford colleagues. One of the points they make is that co-operative tracking by websites is essentially impossible to defend against.
Ubercookies: history stealing. Now let’s get to the scary stuff: uncovering identity. History stealing or history sniffing is an unintended consequence of the way the web is designed; it allows a website to learn which URLs you’ve been to. While a site can’t simply ask your browser for a list of visited URLs, it can ask “yes/no” questions and your browser will faithfully respond. The most common way of doing this is by injecting invisible links into the page using Javascript and exploiting the fact that the CSS link color attribute depends on whether the link has been visited or not.
History stealing has been known for a decade, and browser vendors have failed to fix it because it cannot be fixed without sacrificing some useful functionality (the crude way is to turn off visited link coloring altogether; a subtler solution is SafeHistory). Increasingly worse consequences have been discovered over the years: for example, a malicious site can learn which bank you use and customize a phishing page accordingly. But a paper (full text, PDF) coming out at this year’s IEEE S&P conference at Oakland takes it to a new level.
Identity. Let’s pause for a second and think about what finding your identity means. In the modern, social web, social network accounts have become our de-facto online identities, and most people reveal their name and at least some other real-world information about ourselves on our profiles. So if the attacker can discover the URL of your social network profile, we can agree that he has identified you for all practical purposes. And the new paper shows how to do just that.
The attack relies on the following observations:
- Almost all social networking sites have some kind of “group” functionality: users can add themselves to groups.
- Users typically add themselves to multiple groups, at least some of which are public.
- Group affiliations, just like your movie-watching history and many other types of attributes, are sufficient to fingerprint a user. There’s a high chance there’s no one else who belongs to the same set of groups that you do (or is even close). [Aside: I used this fact to show that Lending Club data can be de-anonymized.]
- Users who belong to a group are likely to visit group-specific URLs that are predictable.
Put the above facts together, and the attack emerges: the attacker (an arbitrary website you visit, without the co-operation of whichever social network is used as an attack enabler) uses history stealing to test a bunch of group-related URLs one by one until he finds a few (public) groups that the anonymous user probably belongs to. The attacker has already crawled the social network, and therefore knows which user belongs to which groups. Now he puts two and two together: using the list of groups he got from the browser, he does a search on the backend to find the (usually unique) user who belongs to all those groups.
Needless to say, this is a somewhat simplified description. The algorithm can be easily modified so that it will work even if some of the groups have disappeared from your history (say because you clear it once in a while) or if you’ve visited groups you’re not a member of. The authors demonstrated that the attack with real users on the Xing network, and also showed theoretically that it is feasible on a number of other social networks including Facebook and Myspace. It takes a few thousand Javascript queries and runs in a few seconds on modern browsers, which makes it pretty much surreptitious.
Fallout. There are only two ways to try to fix this. The first is for all the social networking sites to change their URL patterns by randomizing them so that point 4 above (predictable URL identifying that you belong to a group) is no longer true. The second is for all the browser vendors to fix their browsers so that history stealing is no longer possible.
The authors contacted several of the social networks; Xing quickly implemented the URL randomization fix, which I find surprising and impressive. Ultimately, however, Xing’s move will probably be no more than a nice gesture, for the following reason.
Over the last few days, I have been working on a stronger version of this attack which:
- can make use of every URL in the browser history to try and identify the user. This means that server-side fixes are not possible, because literally every site on the web would need to implement randomization.
- avoids the costly crawling step, further lowering the bar to executing the attack.
That leaves browser-based fixes for history stealing, which hasn’t happened in the 10 years that the problem has been known. Will browsers vendors finally accept the functionality hit and deal with the problem? We can hope so, but it remains to be seen.
In the next article, I will describe the stronger attack and also explain in more detail why your profile page on almost any website is a very strong identifier.
Thanks to Adam Bossy for reviewing a draft.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
23 comments February 18, 2010
Privacy is not Access Control (But then what is it?)
In my previous article on the Google Buzz fiasco, I pointed out that the privacy problems were exacerbated by the fact that the user interface was created by programmers. In this post I will elaborate on that theme and provide some constructive advice on privacy-conscious design, especially for social networking.
The problem I’m addressing is that as far as computer scientists and computer programmers are concerned, privacy is a question of access control, i.e., who is allowed to look at what. Unfortunately, in the real world, that is only a tiny part of what privacy is about. Here are three examples to make my point:
1. Dummy cameras. Consider a thought experiment: suppose the government installed a bunch of cameras all over a public park along with prominent signs announcing 24×7 surveillance. The catch, however, is that the cameras have not been turned on. Has anyone’s privacy been violated?
From the computer science perspective, the answer is no, because no one is actually being observed, nothing is being recorded and no data is being generated. But common sense tells us that something is wrong with that answer. The cameras cause people considerable discomfort. The surveillance, real or imaginary, changes their behavior.
This hypothetical scenario is adapted from Ryan Calo’s paper, which analyzes in detail the “sensation of being observed.”
2. Aggregation changes the equation. Remember the uproar when Facebook released News Feed? No new information was revealed to your friends that wasn’t accessible to them before; it was just that the News Feed made it dramatically easier to observe all your activities on the site.
Of course, it goes both ways: the technology in turn changed people’s expectations; it is now hard to imagine not having a feed-like system, whether on Facebook or another social network. Nevertheless, I often see people putting something into their profile, deciding a few moments later that they didn’t want to share it after all, and realizing that it was too late because the information has already been broadcast to their friends.
3. Everyone-but-X access control, which I described in an earlier article, shows in a direct way how access control fails to capture privacy requirements. From the traditional CS security perspective, the ability for a user to make something visible to “everyone but X” is meaningless: X can always create a fake account to get around it.
But a use-case should hopefully immediately convince you that everyone-but-X is a good idea: your sibling is on your friends list and you want to post about your sex life. It’s not that you want to prevent X from having access to your post, but rather that both of you prefer that X didn’t have access to it.
Access control is not the goal of privacy design. It is at best one of many tools. Rather, human behavior is key. The dummy cameras were bad because they affected the behavior of people in a detrimental way. News feed was bad because it introduced major new privacy consequences for the behaviors that people were accustomed to on the site. (However, I would argue that the dramatic increase in usefulness trumped the privacy drawbacks.) Everyone-but-X privacy is good because it allows people to carry over to the online setting behaviors that they are used to in the real world.
It is impossible to fully analyze the privacy consequences of a design decision without studying its impact on actual user behavior. There is no theoretical framework to ensure that a design decision is safe — user testing is essential. Going back to Google Buzz, a beta period or a more gradually phased roll-out would have undoubtedly been better.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
8 comments February 13, 2010
Google Buzz, Social Norms and Privacy
Another day, another privacy backlash — this time with Google Buzz. What’s new? Lots, as it turns out.
There are many minor ways in which Google Buzz fails, both with regard to privacy and otherwise. For example, I’ve been posting my Buzz updates publicly because the user interface for posting it to a restricted group is horribly clunky. (Post only to my followers? What’s the point of that, when anyone can start following me?! Make it easy to post to a group that I have control over!)
But the major privacy SNAFU, as you’ve probably heard, is auto-follow. Google automatically makes public a list of the top 25 or so people you’ve corresponded with in Gmail or Google talk. Worse, the button to turn this “feature” off resides in your Google-wide profile, making it unnecessarily hard to find because it isn’t within the Buzz interface itself.
This is a classic example of what happens when the user interface is created by programmers instead of designers, a recurring problem for Google. Programmers partition features in a way that fits the computer’s natural data model, rather than the user’s natural mental model.
But getting back to privacy, it is a certainty in a statistical sense that Google outed a few affairs and other secret relationships. For even if you were yourself savvy enough to turn off the public display of your top correspondents, there’s a good chance the other party wasn’t, and might not have turned it off on their end.
When I enabled Buzz and realized what had happened, something changed for me in my head. I’d always regarded email and chat as a private medium. But that’s not true any more; Google forced me to discard my earlier expectations. Even if Google apologizes and retracts auto-follow (not that I think that’s likely), the way I view email has permanently changed, because I can’t be sure that it won’t happen again. I lost some of the privacy expectation that I had of not only Google’s services, but of email and chat in general, albeit to a lesser extent.
What I’ve tried to do in the preceding paragraphs is show in a step-by-step manner how Google’s move changed social norms. The larger players like Google and Microsoft have been very conservative when it comes to privacy, unlike upstarts like Facebook. So why did Google enable auto-follow? By all accounts, their hand was forced: they needed a social network to compete with Facebook and Twitter. Given the head-start that their competitors have, the only real way to compete was to drag their users into participating.
Google ended up changing society’s norms in a detrimental way in order to meet their business objectives. This has become a recurring theme (c.f. the section on Facebook in that article). I don’t think there is any possibility of putting the genie back in the bottle; this trend will only continue. This time it was about who I email; soon my expectations about the contents of emails themselves will probably change.
I believe that in the long run, the only “stable equilibrium” of privacy norms, as it were, would be for everyone to simply assume that everything they type into a computer will be publicly visible either instantly or at some point in the future, outside their control. That is not necessarily as terrible as it may seem. Nonetheless, society will take a long time to get there. Until then, the best we can do is push back against intrusions as much as possible, delaying the inevitable, giving ourselves enough time to adapt.
Do your part to fight back against auto-follow. Let Google know how you feel. Blog about it or leave a comment.
Updates
- A New York Times blogger picked up the controversy.
- Joe Bonneau has an analysis of users’ confused reactions.
- Google has announced that it is rolling out some user-interface changes in response to the feedback. That is better than before, but the default is still public auto-follow.
- The horror stories due to auto-follow have begun.
- I have a new article with advice on privacy-conscious design.
- Google decided to nix auto-follow after all! Awesome.
Thanks to Joe Bonneau for reviewing a draft of this article.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
20 comments February 11, 2010
The Secret Life of Data
Some people claim that re-identification attacks don’t matter, the reasoning being: “I’m not important enough for anyone to want to invest time on learning private facts about me.” At first sight that seems like a reasonable argument, at least in the context of the re-identification algorithms I have worked on, which require considerable human and machine effort to implement.
The argument is nonetheless fallacious, because re-identification typically doesn’t happen at the level of the individual. Rather, the investment of effort yields results over the entire database of millions of people (hence the emphasis on “large-scale” or “en masse”.) On the other hand, the harm that occurs from re-identification affects individuals. This asymmetry exists because the party interested in re-identifying you and the party carrying out the re-identification are not the same.
In today’s world, the entities most interested in acquiring and de-anonymizing large databases might be data aggregation companies like ChoicePoint that sell intelligence on individuals, whereas the party interested in using the re-identified information about you would be their clients/customers: law enforcement, an employer, an insurance company, or even a former friend out to slander you.
Data passes through multiple companies or entities before reaching its destination, making it hard to prove or even detect that it originated from a de-anonymized database. There are lots of companies known to sell “anonymized” customer data: for example Practice Fusion “subsidizes its free EMRs by selling de-identified data to insurance groups, clinical researchers and pharmaceutical companies.” On the other hand, companies carrying out data aggregation/de-anonymization are a lot more secretive about it.
Another piece of the puzzle is what happens when a company goes bankrupt. Decode genetics recently did, which is particularly interesting because they are sitting on a ton of genetic data. There are privacy assurances in place in their original Terms of Service with their customers, but will that bind the new owner of the assets? These are legal gray areas, and are frequently exploited by companies looking to acquire data.
At the recent FTC privacy roundtable, Scott Taylor of Hewlett Packard said his company regularly had the problem of not being able to determine where data is being shared downstream after the first point of contact. I’m sure the same is true of other companies as well. (How then could we possibly expect third-party oversight of this process?) Since data fuels the modern Web economy, I suspect that the process of moving data around will continue to become more common as well as more complex, with more steps in the chain. We could use a good name for it — “data laundering,” perhaps?
1 comment February 6, 2010
In which I come out: Notes from the FTC Privacy Roundtable
I was on a panel at the second FTC privacy roundtable in Berkeley on Thursday. Meeting a new community of people is always a fascinating experience. As a computer scientist, I’m used to showing up to conferences in jeans and a T-shirt; instead I found myself dressing formally and saying things like “oh, not at all, the honor is all mine!”
This post will also be the start of a new direction for this blog. So far, I’ve mostly confined myself to “doing the math” and limiting myself to factual exposition. That’s going to change, for two reasons:
- The central theme of this blog and of my Ph.D dissertation — the failure of data anonymization — now seems to be widely accepted in policy circles. This is due in large part to Paul Ohm’s excellent paper, which is a must-read for anyone interested in this topic. I no longer have to worry about the acceptance of the technical idea being “tainted” by my opinions.
- I’ve been learning about the various facets of privacy — legal, economic, etc. — for long enough to feel confident in my views. I have something to contribute to the larger discussion of where technological society is heading with respect to privacy.
Underrepresentation of scientists
As it turned out, I was the only academic computer scientist among the 35 panelists. I found this very surprising. The underrepresentation is not because computer scientists have nothing to contribute — after all, there were other CS Ph.Ds from industry groups like Mozilla. Rather, I believe it is a consequence of the general attitude of academic scientists towards policy issues. Most researchers consider it not worth their time, and a few actively disdain it.
The problem is even deeper: academics have the same disdainful attitude towards the popular exposition of science. The underlying reason is that the goal in academia is to impress one’s peers; making the world better is merely a side-effect, albeit a common one. The incentive structure in academia needs to change. I will pick up this topic in future posts.
The FTC has an admirable approach to regulation
As I found out in the course of the day’s panels, the FTC is not about prescribing or mandating what to do. Pushing a specific privacy-enhancing technology isn’t the kind of thing they are interested in doing at all. Rather, they see their role as getting the market to function better and the industry to self-regulate. The need to avoid harming innovation was repeatedly emphasized, and there was a lot of talk about not throwing the baby out with the bathwater.
The following were the potential (non baby hurting) initiatives that were most talked about:
- Market transparency. Markets can only work well when there is full information, and when it comes to privacy the market has failed horribly. Users have no idea what happens to their data once it’s collected, and no one reads privacy policies. Regulation that promotes transparency can help the market fix itself.
- Consumer education. This is a counterpart to the previous point. Education about privacy dangers as well as privacy technologies can help.
- Enforcement. A few bad apples have been responsible for the most egregious privacy SNAFUs. The larger players are by and large self-regulating. The FTC needs to work with law enforcement to punish the offenders.
- Carrots and sticks. Even the specter of regulation, corporate representatives said, is enough to get the industry to self-regulate. Many would disagree, but I think a carrots-and-sticks approach can be made to work.
- Incentivizing adoption of PETs (privacy enhancing technologies) in general. The question of how the FTC can spur the adoption of PETs was brought up on almost every panel, but I don’t think there were any halfway convincing answers. Someone mentioned that the government in general could go into the market for PETs, which seems reasonable.
As a libertarian, I think the overall non-interventionist approach here is exactly right. I’m told that the FTC is rather unusual among US regulatory agencies in this regard (which makes sense, considering that the FCC, for example, spends its time protecting children from breasts when it is not making up lists of words.)
Facebook’s two faces

Facebook public policy director Tim Sparapani, who was previously with the ACLU, made a variety of comments on the second panel that were bizarre, to put it mildly. Take a look (my comments are in sub-bullets):
- “We absolutely compete on privacy.”
- That’s a weird definition of “compete.” Facebook has a history of rolling out privacy-infringing updates, such as Beacon, the ToS changes, and the recent update that made the graph public. Then they wait to see if there’s an outcry and roll back some of the changes. It is hard to think of another company has had such a cavalier approach.
- “There are absolutely no barriers to entry to create a new social network.”
- Except for that little thing called the network effect, which is the mother of all barriers to entry. In a later post I will analyze why Facebook has reached a critical level of penetration in most markets which makes it nearly unassailable as a general-purpose social network.
- “Our users have learned to trust us.”
- I don’t even know what to say about this one.
- “We are a walled garden.”
- Sparapani is confusing two different senses of “walled garden” here. This was said in response to a statement by the Google rep about Google’s features to let users migrate their data to other services (which I find very commendable). In this sense, Facebook is indeed a walled garden, and doesn’t allow migration, which is a bad thing. But Sparapani said he meant it in the sense that Facebook doesn’t sell user data wholesale to other companies. That sounds like good news, except that third party app developers end up sharing user data with other entities, because enforcement of the application developer Terms of Service is virtually non-existent.
- “If you delete the data it’s gone.” (in the context of deleting your account)
- That might be true in a strict sense, but it is misleading. Deleting all your data is actually impossible to achieve because most pieces of data belong to more than one user. Each of your messages will live on in the other person’s inbox (and it would be improper to delete it from theirs). Similarly, photos in which you appear, which you would probably like gone when you delete your account, still live on in the album of whoever took the picture. The same goes for your pokes, likes and other multi-user interactions. These are the very things that make a social network social.
- “We now have controls on privacy at the moment you share data. This is an extraordinary innovation and our engineers are really proud of it.”
- The first part of that statement is true: you can now change the privacy controls on each of your Facebook status messages independently. The second part is downright absurd. It is completely trivial to implement from an engineering perspective (and LiveJournal for instance has had it for a decade).
There were more absurd statements, but you get the picture. It’s not just the fact that Sparapani’s comments were unhinged from reality that bothers me — the general tone was belligerent and disturbing. I missed a few minutes of the panel, during which he apparently he responded to a criticism from Chris Conley of the ACLU by saying “I was at the ACLU longer than you’ve been there.” This is unprofessional, undignified and a non-answer. Amusingly, he claimed that Facebook was “very proud” of various aspects of their privacy track record at least half a dozen times in the course of the panel.
Contrast all this with Mark Zuckerberg’s comments in an interview with Michael Arrington, which can be summed up as “the age of privacy is over.” That article goes on to say that Facebook’s actions caused the shift in social norms (to the extent that they have shifted at all) rather than merely responding to them. Either way, it is unquestionable that Facebook’s true behavior at the present time pays lip service to privacy, and Zuckerberg’s statement is a more-or-less honest reflection of that. On the other hand, as I have shown, the company sings a completely different tune when the FTC is listening.
Engaging privacy skeptics
Aside from Facebook’s shenanigans, I feel that that there are two groups in the privacy debate who are talking past each other. One side is represented by consumer advocates, and is largely echoed by the official position of the FTC. The other side’s position can be summed up as “yeah, whatever.” When expressed coherently, there are three tenets of this position (with the caveats that not all privacy skeptics adhere to all three):
- Users don’t care about privacy any more
- Even if they do, privacy is impossible to achieve in the digital age, so get over it
- There are no real harms arising from privacy breaches.
To the right is an illustrative example of a mainstream-media representative who was at the workshop covering it on Twitter through the lens of his preconceived prejudices.
Privacy scholars never engage with the skeptics because the skeptical viewpoint appears obviously false to anyone who has done some serious thinking about privacy. However, it is crucial to engage the opponents, because 1. the skeptical view is extremely common 2. many of the startups coming out of the valley fall into this group, and they are are going to have control over increasing amounts of user data in the years to come.
The “privacy is dead” view was most famously voiced by Scott McNealy. In its extreme form it is easy to argue against: “start streaming yourself live on the Internet 24/7, and then we’ll talk.” (To be sure, a few people did this 10 years ago as a publicity stunt, but it is obvious that the vast majority of people aren’t ready for this level of invasiveness of monitoring/data collection.) But engaging with skeptics isn’t about refutation, it’s about dealing with a different way of thinking and getting the message across to the other side. Unfortunately real engagement hasn’t really been happening.
I have a double life in academia and the startup world, and I think this puts me in a somewhat unusual position of being able to appreciate both sides of the argument. My own viewpoint is somewhere in the middle; I will expand on this theme in future blog posts.
13 comments January 31, 2010
The Entropy of a DNA profile
I’m often asked how much entropy there is in the DNA profiles used in forensic investigations. Specifically, is it more than 33 bits, i.e., can it uniquely identify individuals? The short answer is: yes in theory, but there are many caveats in practice, and false matches are fairly common.
To explain the details, let’s start by looking at what is actually stored in a DNA profile. Your entire genome consists of billions of base pairs, but for profiling purposes, only a tiny portion of it is looked at — specifically, 13 locations or loci (in the U.S. version, which I will focus on. The U.K. version uses 10 loci.) Each of these loci yields a pair of integers which varies from person to person. You can see an example DNA profile on this page.
The degree of variation in the pairs of numbers — genotypes — at each locus has been empirically measured by many studies. Since biological laws dictate that the genotypes at different loci are uncorrelated, we can calculate entropy by simply adding up the entropy at individual loci. I analyzed (source code) the raw data on variation at each locus from a sample of U.S. Caucasians, and arrived at a figure of between 3.0 and 5.6 bits of entropy per locus and 54 bits of entropy for the whole 13-locus DNA profile. In addition, there is 1 sex-determining bit.
Since that number is well over 33 bits, with a high probability there is no one else who shares your DNA profile. However, there are many complications to this rosy picture:
Non-uniform genotype probabilities. The entropy calculation doesn’t quite tell the whole story, because some genotypes at each locus are much more common than others. If you happen to end up with a common genotype in all (or most) of the 13 loci, then there might be a significant chance that someone else in the world shares your DNA profile.
Population structure. The calculation above assumes the Hardy-Weinberg equilibrium, which is only true if mating is random, among other things. In reality, due to the non-random population structure, there is a slight deviation from the theoretical value. This manifests in two ways: first, the allele frequencies for different population groups (ethnic groups) need to be calculated separately. Second, there is a deviation from the expected genotype frequencies even within population groups, which is more difficult to account for (a correction factor called “theta” is applied in forensic calculations).
Familial relationships. Since we share half of our DNA with each parent and sibling, there is a much higher chance of a profile match between close relatives than between unrelated individuals. Therefore DNA database matches often turn up a relative of the perpetrator even if the perpetrator is not in the database (especially with partial matches; see below).
In recent years, law enforcement has sometimes adopted the strategy of turning this problem on its head and using these familial leads as starting points of investigation as a way to get to the true perpetrator. This is a controversial practice.
Each of the above factors results in an increase in the probability of a match between different individuals. But the effect is small; even after taking them into account, as long as we’re talking about the full 13-locus profile, most individuals do in fact have a unique DNA profile, albeit fewer than would be predicted by the simple entropy calculation.
Unfortunately, crime-scene sample collection is far from perfect, and profiles are often not extracted accurately from the physical samples due to limitations of technology and the quality of the sample. These inaccuracies in a crime-scene profile introduce errors into the matching process, which are the primary reason for false matches in investigations.
Partial and mixed profiles. Sometimes only a “partial profile” can be extracted from a crime-scene DNA sample. This means that only a subset of the 13 genotypes can be measured. This could be because the quantity of DNA available is too small (interfering with the “amplification” process at some of the loci), because the DNA has degraded, or because it is contaminated with chemicals called PCR inhibitors that interfere with the decoding process.
The other type of inaccuracy occurs when the DNA sample collected is in fact a mixture from multiple individuals. If this happens, multiple values for some genotypes might be measured. There is no foolproof way of separating the genotypes of each individual in the mixture.
These are very common occurrences, particularly partial profiles. There are no standards on the quality or quantity of the profile data for the evidence to be admissible in court. Instead, an expert witness computes a “likelihood ratio” based on the specific partial or mixed profile, and presents this to the court. Juries are often left not knowing how to interpret the number they are presented and are vulnerable to the prosecutor’s fallacy.
The birthday paradox. The history of DNA testing is littered with false matches and leads; one reason is the birthday paradox. The number of pairs of individuals in a database of size N grows proportional to N². The FBI database, for instance, has about 200,000 crime-scene profiles and 5 million offender profiles, for a total of a 1 trillion pairs of profiles. Due to use of partial profiles to find matches, the probability of a match between two random profiles is much higher than one in a trillion.
This long but fascinating paper has many hilarious stories of false DNA matches. Laboratory errors such as mixing up labels on the samples and contamination of the sample with the technician’s DNA appear to be depressingly common as well. Here is another story of lab contamination that cost $14 million.
Why only 13 loci? One question that all this raises is that if the use of a small number of loci causes problems when only a partial profile is available, why not use more of the genome, or even all of it? Research on mini-STRs shows how to better utilize degraded DNA to recover genotypes from beyond the 13 CODIS loci. The cost of whole-genome genotyping has been falling dramatically, and enables even individuals contributing trace amounts of DNA to a mixture to be identified!
One stumbling block seems to be the small quantity of DNA available from crime scenes; whole genome amplification is being developed to address that. But I suspect that the main reason is inertia: forensic protocols, procedures and expertise in DNA profiling have evolved over the last two decades, and it would be costly to make any changes at all. Whatever the reasons, I’m certain that things are going to be very different in a decade or two, because there are millions of bits of entropy in the entire genome, and forensic science currently uses about 54 of them.
Further reading.
- Wikipedia: DNA profiling
- Legal decisions concerning DNA statistics
- DNA Testing: An Introduction for Non-Scientists
- The Potential for Error in Forensic DNA Testing (referenced above, but worth repeating!)
1 comment December 2, 2009







