Posts Tagged privacy
What Every Developer Needs to Know About “Public” Data and Privacy
It is natural for developers building web applications to operate under a public/private dichotomy, the assumption being that if a user made a piece of data public, then they’ve given up any privacy expectation. But as we saw in a previous article, users often expect more subtle distinctions, and many unfortunate privacy blunders have resulted. To avoid repeats of these, engineers need to be able to reason about the privacy implications of specific technical features. This article presents a set of criteria for doing so.
1. Archiving
Computers are designed to keep data around forever unless explicitly deleted. But this assumption makes many nontechnical people deeply uncomfortable. There have been a number of proposals to “make the Internet forget,” bringing it in line with humans’ anthropomorphic expectations. While nothing much will probably result from these broad proposals, there need to be some controls on archiving, especially by third parties. Here are three examples that illustrate why this is important:
- A woman was fired from her job recently because of her employer found some of her online revelations objectionable. She got caught because Topsy, a Twitter search engine, retained her personal data in its cache even after she had deleted it from Twitter.
- Joe Bonneau revealed that the vulnerability of photo-sharing sites failing to delete photos from their CDN caches persists on many sites, a full year after it was first made public and received media attention.
- Facebook acted in a heavy-handed manner in its recent spat with Pete Warden. The company’s rationale for prohibiting crawlers seems to be that they want to impose fine-grained restrictions on third party data use. Nontrivial policies can be specified via the Terms of Use, but not via robots.txt.
The examples above show a clear need for a standard for machine-readable third-party data retention policy — a robots.txt on steroids, if you will. Pete Warden proposed expanding robots.txt a few months ago; now that multiple sites are facing this problem, perhaps there will be some momentum in this direction.
2. Real-time
The real-time web relies on “pushing” updates to clients instead of the traditional model of crawling. The push model greatly improves timeliness and machine load, but the problem is that there is typically no way to delete or update existing items in real-time.
This fact bites me on a regular basis. When I make a blog post, Google reader gets hold of it immediately, but if I realize I wrote something stupid and update the post, it doesn’t show up for several hours because updates don’t propagate through the real-time mechanism.
Or consider tweets: if you tweet something inappropriate and delete it a second later, it might be too late: Twitter’s partners could have already gotten hold of it through the “firehose,” and it might already be displayed on a sidebar on some other site.
Google’s “undo send” feature is a great solution to this type of problem — it holds the message in a queue for a few seconds before sending it out. Every real-time system needs such a panic feature!
3. Search
While making data searchable greatly increases its utility, it also dramatically increases the privacy risks. It is tempting to tell users to get used to the fact that everything they write is searchable, but that hasn’t been successful so far, as IRSeek found out when they tried to launch an IRC search engine. There are entire companies like ReputationDefender that help you clean up the web search results for your name.
The lack of searchability of your site can be a feature. This is obviously not true for the majority of sites, but it is worth keeping in mind. One major reason why LiveJournal has a “closed” feel — which is a big part of its appeal — is that posts don’t rank well in Google searches, if they are indexed at all. For example, Livejournal posts have a numeric ID instead of title words in the URL. Although it sounds like someone skipped SEO 101, it is actually by design.
4. Aggregation
By aggregate data I mean data from a single source or website, comprising all or a significant fraction of the users. The appeal of aggregate data for research is clear: not only are larger quantities better, aggregation avoids the bias problems of sampling. On the other hand, the privacy concerns are also clear: the fear is that the data will end up at the hands of the wrong people, such as one of the database marketing companies.
Aggregation is the most common of the privacy problems among the 7 examples I listed in my previous article. In some cases the original source made the data available and then backtracked, in other cases a third party crawled the data and got into trouble, and some were a mix of both.
For websites sitting on interesting data, an excellent compromise would be in-house data analysis (or perhaps a partnership program with outside researchers), as an alternative to making data public. OkCupid has been doing this extremely well, in my opinion — they have a great series of blog posts on race, looks and everything else that affects online dating. The man-hours spent on data analysis are well worth the increased pageviews and mindshare. Facebook has a data team as well, but given the quantity of data they have, they could be publishing quite a bit more.
5. Linkage
By linkage I refer to connecting the same person across multiple websites. Confusingly, this is sometimes referred to as aggregation. Linkage can take the form of database marketers connecting different databases of personal information, or in the online context, it can take the form of tools that link together individual profiles on different websites.
Pervasive online identities are becoming the norm, which is something I’ve been writing about. All of your online activities are going to be easily linkable sooner or later unless you explicitly take steps to keep your identities separate. But again, users haven’t quite woken up to this yet. Unwanted linkage is therefore something that can upset users greatly. The auto-connect feature in Google Buzz is the best example. Opt-in rather than opt-out is probably the way to go, at least for a few years until everyone gets used to it.
Summary. While well-understood access control principles tell us how to implement the privacy of data marked private, the privacy of “public” data is just as big a concern. So far there has been no systematic way of analyzing exactly what it is that users object to. In this article I’ve presented five such features. To avoid nasty surprises, developers building websites need to think carefully about privacy and user behavior when implementing any of these features.
Thanks to Ann Kilzer for reviewing a draft.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
5 comments July 6, 2010
Myths and Fallacies of “Personally Identifiable Information”
I have a new paper (PDF) with Vitaly Shmatikov in the June issue of the Communications of the ACM. We talk about the technical and legal meanings of “personally identifiable information” (PII) and argue that the term means next to nothing and must be greatly de-emphasized, if not abandoned, in order to have a meaningful discourse on data privacy. Here are the main points:
The notion of PII is found in two very different types of laws: data breach notification laws and information privacy laws. In the former, the spirit of the term is to encompass information that could be used for identity theft. We have absolutely no issue with the sense in which PII is used in this category of laws.
On the other hand, in laws and regulations aimed at protecting consumer privacy, the intent is to compel data trustees who want to share or sell data to scrub “PII” in a way that prevents the possibility of re-identification. As readers of this blog know, this is essentially impossible to do in a foolproof way without losing the utility of the data. Our paper elaborates on this and explains why “PII” has no technical meaning, given that virtually any non-trivial information can potentially be used for re-identification.
What we are gunning after is the get-out-of-jail-free card, a.k.a. “safe harbor,” particularly in the HIPAA (health information privacy) context. In current practice, data owners can absolve themselves of responsibility by performing a syntactic “de-identification” of the data (although this isn’t the spirit of the law). Even your genome is not considered identifying!
Meaningful privacy protection is possible if account is taken of the specific types of computations that will be performed on the data (e.g., collaborative filtering, fraud detection, etc.). It is virtually impossible to guarantee privacy by considering the data alone, without carefully defining and analyzing its desired uses.
We are well aware of the burden that this imposes on data trustees, many of whom find even the current compliance requirements onerous. Often there is no one available who understands computer science or programming, and there is no budget to hire someone who does. That is certainly a conundrum, and it isn’t going to be fixed overnight. However, the current situation is a farce and needs to change.
Given that technologically sophisticated privacy protection mechanisms require a fair bit of expertise (although we hope that they will become commoditized in a few years), one possible way forward is by introducing stronger acceptable-use agreements. Such agreements would dictate what the collector or recipient of the data can and cannot do with it. They should be combined with some form of informed consent, where users (or, in the health care context, patients) acknowledge their understanding that there is a re-identification risk. But the law needs to change to pave the way for this more enlightened approach.
Thanks to Vitaly Shmatikov for comments on a draft of this post.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
6 comments June 21, 2010
Conferences: The Good, the Bad and the Ugly aspects
I attended a couple of conferences this week that are outside my usual community. Taking stock of and interacting with a new crowd is always a very interesting experience.
The first was the IAPP Practical Privacy Series. The International Association of Privacy Professionals came about as a result of the fact that the Chief Privacy Officer (and equivalent) positions have suddenly emerged — over the last decade — and become ubiquitous. The role can be broadly described as “privacy compliance.” A big part of the initial impetus seems to have been HIPAA compliance, but the IAPP composition has now diversified greatly, because virtually every company is sitting on a pile of consumer data. There was even someone from Starbucks.
I spoke about anonymization. I was trying to answer the question, “I need to share/sell my data and you’re telling me that anonymization is broken. So what should I do?”. It’s always a fun challenge to make computer science accessible to a non-tech audience (largely lawyers in this case). I think I managed reasonably well.
Next was the ACM Computers, Freedom and Privacy conference (which goes on until Friday). As I understand it, CFP was born at a time when “Cyberspace” was analogous to the Wild West, and there was a big need for self-governance and figuring out the emerging norms. The landscape is of course very different now, since the Internet isn’t a band of outlaws anymore but integrated into normal society. The conference has accordingly morphed somewhat, although a lot of the old crowd still definitely comes here.
The quality of the events I attended were highly variable. I checked out the “unconferences,” but only a couple had a meaningful level of participation and the one I went to seemed to devolve pretty quickly into a penis-waving contest. The session I liked best was a tutorial by Mike Godwin (of Godwin’s law, now counsel for the Wikimedia foundation) on Cyberlaw, mainly First Amendment law.
CFP has parallel sessions. I had a great experience with that format at the Privacy Law Scholars Conference, but this time I’m not so sure — I’m regularly finding conflicts among the sessions I want to attend.
I’m bummed about the fact that there is really no mechanism for me to learn about conferences that are relevant to my interests but are outside my community. (I only learned about the IAPP workshop because I was invited to speak, and CFP purely coincidentally.) Do other researchers face this problem as well? I’m curious to hear about how people keep abreast. I mean, it’s 2010, and this is exactly the kind of problem that social media is supposed to be great at solving, but it’s not really working for me.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
4 comments June 17, 2010
Facebook, Privacy, Public Opinion and Pitchforks
As just about everyone is already aware, Facebook has been up to a bunch of big brotherly stuff lately, including “instant personalization” — making your identity and data available to 3rd party sites you visit, arguing to treat ToS violations as criminal violations, and forcing you to make your “interests” public (or delete them). Overall, it looks like they’re making a bold move to take control of everyone’s identity and connections, privacy be damned.
The entirely predictable effect of this has been that everything the company now does is being viewed with extreme suspicion. The pitchforks have been sharpened, and the mob gets set off on almost any excuse. In the last week, one somewhat questionable feature, one minor bug and one utter non-event have each been reported as sinister privacy disasters:- The questionable feature was linking your statuses to “connections” pages. The outrage was based on the meme “if your status contains the word FBI then the FBI will have a record of it,” which appears to have started here. That article is full of hyperbole and understandably appears to have been widely misunderstood to be claiming that even private statuses appear on Connection pages (they don’t). There’s really nothing new in terms of the visibility of your statuses: Facebook already had real-time search for public statuses, and the only difference is that someone can now click on the “FBI” page instead of having to type in “FBI” into the search box.
- The minor bug was that Facebook started listing Connect-enabled websites you visit in the “Applications” tab in your privacy settings. The sites didn’t get your identity, any of your data, nor did they have priveleges to post to your wall. The fact that you visited them was not visible to anyone else. No actual harm was done. And yet an article titled Facebook’s new features secretly add apps to your profile alleged all of these things without making any real effort to check with Facebook. Facebook quickly fixed the bug and contacted the authors, and they updated the story, but it did little to quell the rumors which took on a life of their own.
- The non-issue was Facebook leaking your IP address in email notifications. This is normal behavior: most webmail providers, except gmail, put the sender’s IP into the message header as a spam-prevention technique. This kicked up another shitstorm.
In spite of these unfair accusations, it is hard for me to feel any sympathy for the beleaguered company. This is how public opinion works, and they can’t claim not to have seen it coming. As this fantastic visualization by Matt McKeon shows, Facebook has been on a long and consistent path to make all of your information public, essentially pulling a giant bait-and-switch on their users. They stepped up the pace recently, asked their users to give up too much too fast, and something just snapped.
I think Facebook underestimated the extent to which privacy correlates with trust. They were forgiven for Beacon and other problems in the past, but after the most recent series of privacy violations, it became clear that these were not missteps but deliberate actions. I believe that Facebook’s relationship with its users has changed fundamentally, and isn’t going to mend any time soon. Perhaps Facebook’s reckoning is that they are now big enough that it doesn’t matter any more. That remains to be seen.
On a personal note, someone pretty high up at Facebook emailed me a couple of months ago (although “not in an official capacity”) to have a discussion about privacy issues with some of their upcoming product launches. Unfortunately I was traveling at the time, and when I got back they were no longer interested. I guess by then it was too close to f8 and all the important decisions had been made. I can’t help wondering if the outcome might have been different if I’d been able to meet with them — perhaps they might have eased off just a little bit on their world-domination plans and avoided the straw that broke the camel’s back. But I suspect that that’s just wishful thinking, given that the imperative for their current push in all likelihood came from the very top.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
5 comments May 10, 2010
Is Making Public Data “More Public” a Privacy Violation?
What on earth does more public mean? Technologists draw a simple distinction between data that is public and data that is not. Under this view, the notion of making data more public is meaningless. But common sense tells us otherwise: it’s hard to explain the opposition to public surveillance if you assume that it’s OK to collect, store and use “public” information indiscriminately.
There are entire philosophical theories devoted to understanding what one can and cannot do with public data in different contexts. Recently, danah boyd argued in her SXSW keynote in support of “privacy through obscurity” and how technology is destroying this comfort. According to boyd, most public data is “quasi-public” and technologists don’t have the right to “publicize” it.
Some examples. One can debate the point in the abstract, but there is no question that companies and individuals have repeatedly been bitten when applying the “it’s already public” rule. Let’s look at some examples (the list and the discussion is largely concerned with data on the web).
- The availability of the California Birth Index on the web caused considerable consternation about a decade ago, despite the fact that birth records in the state are public and anyone’s birth record can be obtained through official channels albeit in a cumbersome manner.
- IRSeek planned to launch a search engine for IRC in 2007 by monitoring and indexing public channels (chatrooms). There was a predictable privacy outcry and they were forced to shut down.

- The Infochimps guys crawled the Twitter graph back in 2008 and posted it on their site. Twitter forced them to take the dataset down.
- The story was repeated with Pete Warden and Facebook; this time it was nastier and involved the threat of a lawsuit.
- MySpace recently started selling user data in bulk on Infochimps. As MySpace has pointed out, the data is already public, but privacy concerns have nevertheless been raised.
- One reason for the backlash against Google Buzz was auto-connect: it connected your activity on Google Reader and other services and streamed it to your friends. Your Google Reader activities were already public, but Buzz took it further by broadcasting it.
- Spokeo is facing similar criticism. As Snopes explains, “Spokeo displays listings that sometimes contain more personal information than many people are comfortable having made publicly accessible through a single, easy-to-use search site.”
The latter four examples are all from the last couple of months. For some reason the issue has suddenly started cropping up all the time. The current situation is bad for everyone: data trustees and data analysts have no clear guidelines in place, and users/consumers are in a position of constantly having to fight back against a loss of privacy. We need to figure out some ground rules to decide what uses of public data on the web are acceptable.
Why not “none?” I don’t agree with a blanket argument against using data for purposes other than originally intended, for many reasons. The first is that users’ privacy expectations, when they go beyond the public/private dichotomy, are generally poorly articulated, frequently unreasonable and occasionally self-contradictory. (An unfortunate but inevitable consequence of the complexity of technology.) The second reason is that these complex privacy rules, even if they can be figured out, often need to be communicated to the machine.
The third reason is the “greater good.” I’ve opposed that line of reasoning when used to justify reneging on an explicit privacy promise. But when it comes to a promise that was never actually made but merely intuitively understood (or mis-understood) by users, I think the question is different, and my stance is softer. Privacy needs to be weighed against the benefit to society from “publicizing” data — disseminating, aggregating and analyzing it.
In the next article of this series, I will give a rigorous technical characterization of what constitutes publicizing data. My hope is that this will go a long way towards determining what is and is not a violation of privacy. In the meanwhile, I look forward to hearing different opinions.
Thanks to Pete Warden and Vimal Jeyakumar for comments on a draft.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
13 comments April 5, 2010
An open letter to Netflix from the authors of the de-anonymization paper
Dear Netflix,
Today is a sad day. It is also a day of hope.
It is a sad day because the second Netflix challenge had to be cancelled. We never thought it would come to this. One of us has publicly referred to the dampening of research as the “worst possible outcome” of privacy studies. As researchers, we are true believers in the power of data to benefit mankind.
We published the initial draft of our de-anonymization study just two weeks after the dataset for the first Netflix Prize became public. Since we had the math to back up our claims, we assumed that lessons would be learned, and that if there were to be a second data release, it would either involve only customers who opted in, or a privacy-preserving data analysis mechanism. That was three and a half years ago.
Instead, you brushed off our claims, calling them “absolutely without merit,” among other things. It has taken negative publicity and an FTC investigation to stop things from getting worse. Some may make the argument that even if the privacy of some of your customers is violated, the benefit to mankind outweighs it, but the “greater good” argument is a very dangerous one. And so here we are.
We were pleasantly surprised to read the plain, unobfuscated language in the blog post announcing the cancellation of the second contest. We hope that this signals a change in your outlook with respect to privacy. We are happy to see that you plan to “continue to explore ways to collaborate with the research community.”
Running something like the Netflix Prize competition without compromising privacy is a hard problem, and you need the help of privacy researchers to do it right. Fortunately, there has been a great deal of research on “differential privacy,” some of it specific to recommender systems. But there are practical challenges, and overcoming them will likely require setting up an online system for data analysis rather than an “anonymize and release” approach.
Data privacy researchers will be happy to work with you rather than against you. We believe that this can be a mutually beneficial collaboration. We need someone with actual data and an actual data-mining goal in order to validate our ideas. You will be able to move forward with the next competition, and just as importantly, it will enable you to become a leader in privacy-preserving data analysis. One potential outcome could be an enterprise-ready system which would be useful to any company or organization that outsources analysis of sensitive customer data.
It’s not often that a moral imperative aligns with business incentives. We hope that you will take advantage of this opportunity.
Arvind Narayanan and Vitaly Shmatikov
For background, see our paper and FAQ.
To stay on top of future posts on 33bits.org, subscribe to the RSS feed or follow me on Twitter.
19 comments March 15, 2010
Data Privacy: The Story of a Paradigm Shift
Let’s take a break from the Ubercookies series. I’m at the IPAM data privacy workshop in LA, and I want to tell you about the kind of unusual scientific endeavor that it represents. I’ve recently started to write about the process of doing science, what’s good and what’s bad about it, and I expect to have more to say on this topic in this blog.
While “paradigm shift” has become a buzzword, the original sense in which Kuhn used it refers to a specific scientific process. I’ve had the rare experience of witnessing such a paradigm shift unfold, and I may even have played a small part. I am going to tell that story. I hope it will give you a “behind-the-scenes” look into how science works.
I will sidestep the question of whether data privacy is a science. I think it is a science to the extent that computer science is a science. At any rate, I think this narrative provides a nice illustration of Kuhn’s ideas.
First I need to spend some time setting up the scene and the actors. (I’m going to take some liberties and simplify things for the benefit of the broader audience, and I hope my colleagues will forgive me for it.)
The scene. Privacy research is incredibly multidisciplinary, and this workshop represents one extreme of the spectrum: the math behind data privacy. The mathematical study of privacy in databases centers on one question:
If you have a bunch of data collected from individuals, and you want to let other people do something useful with the data, such as learning correlations, how do you do it without revealing individual information?
There are roughly 3 groups that investigate this question and are represented here:
- computer scientists with a background in cryptography / theoretical CS
- computer scientists with a background in databases and data mining
- statisticians.
This classification is neither exhaustive nor strict, but it will suffice for my current purposes.
One of the problems with science and math research is that different communities studying different aspects of the the the same problem (or even studying the same problem from different perspectives) don’t meet together very often. For one, there is a good deal of friction in overcoming the language barriers (different names/ways of thinking about the same things). For another, academics are rewarded primarily for publishing in their own communities. That is why the organizers deserve a ton of credit for bridging the barriers and getting people together.
The paradigms. There is a fundamental, inescapable tension between the utility of data and the privacy of the participants. That’s the one thing that theorists and practitioners can agree on
Given that fact, there are two approaches to go about building a theory of privacy-protection, which I will call utility-first and privacy-first. Statisticians and database people tend to prefer the former paradigm, and cryptographers the latter; but this is not a clean division.
Utility-first hopes to be able to preserve the statistical computations that we would want to do if we didn’t have to worry about privacy, and then ask, “how can we improve the privacy of participants while still doing all these things?” Data anonymization is one natural technique that comes out of this world view: if you are only doing simple syntactic transformations to the data, the utility of the data is not affected very much.
On the other hand, privacy-first says, “let’s first figure out a rigorously provable way to assure the privacy of participants, and then go about figuring out what are the types of computations that can be carried out under this rubric.” The community has collectively decided, with good reason, that differential privacy is the right rubric to use. To explain it properly would require many Greek symbols, so I won’t.
Privacy-first and utility-first are scientific paradigms, not theories. Neither is falsifiable. We can say that one is better, but that is a judgement.
An important caveat must be noted here. The terms do not refer to the social values of putting the utility of the data before the privacy of the participants, or vice versa. Those values are external to the model and are constraints enforced by reality. Instead, we are merely talking about which paradigm gives us better analytical techniques to achieve both the utility and privacy requirements to the extent possible.
The shift. With utility-first, you have strong, well-understood guarantees on the usefulness of the data, but typically only a heuristic analysis of privacy. What this translates to is an upper bound on privacy. With privacy-first, you have strong, well-understood privacy guarantees, but you only know how to perform certain types of computations on the data. So you have a lower bound on utility.
That’s where things get interesting. Utility-first starts to look worse as time goes on, as we discover more and more inferential techniques for breaching the privacy of participants. Privacy-first starts to look better with time, as we discover that more and more types of data-mining can be carried out due to innovative algorithms. And that is exactly how things have played out over the last few years.
I was at a similarly themed workshop at Bertinoro, Italy back in 2005, with much the same audience in attendance. Back then, the two views were about equally prevalent; the first papers on differential privacy were being written or had just been written (of course, the paradigm itself was not new). Fast forward 5 years, and the proponents of one view have started to win over the other, although we quibble to no small extent over the details. Overall, though, the shift has happened in a swift and amicable way, with both sides now largely agreeing on differential privacy.
Why did privacy-first win? I can see many reasons. The privacy protections of the utility-first techniques kept getting broken (a Kuhnian “crisis”?); the de-anonymization research that I and others worked on played a big part here. Another reason might be the way the cryptographic community operates: once they decide that a paradigm is worth investigating, they tend to jump in on it all at once and pick the bones clean. That ensured that within a few years, a huge number of results of the form “how to compute X with differential privacy” were published. A third reason might very well be the fact that these interdisciplinary workshops exist, giving us an opportunity to change each other’s minds.
The fallout. While the debate in theoretical circles seems largely over, the ripple effects are going to be felt “downstream” for a long time to come. Differential privacy is only slowly penetrating other areas of research where privacy is a peripheral but not a fundamental object of study. As for law and policy, Ohm’s paper on the failure of anonymization has certainly created a bang there.
That leaves the most important contingent: practitioners. Technology companies have been quick to learn the lessons — differential privacy was invented by Microsoft researchers — and have been studying questions like sharing search logs with differential privacy assurances and building programming systems incorporating differential privacy (see PINQ developed at Microsoft Research and Airavat funded by Google.)
Other sectors, especially medical informatics, have been far slower to adapt, and it is not clear if they ever will. Multiple speakers at this workshop dealing with applications in different sectors talked about their efforts at anonymizing high-dimensional data (good luck with that). The problems are compounded by the fact that differential privacy isn’t yet at a point where it is easily usable in applications and in many cases the upshot of the theory has been to prove that the simultaneous utility and privacy requirements simply cannot be met. It will probably be the better part of a decade before differential privacy starts to make any real headway into real-world usage.
Summary. I hope I’ve shown you what scientific “paradigms” are, how they are adopted and discarded. Paradigm shifts are important turning points for scientific disciplines and often have big consequences for society as a whole. Finally, science is not a cold sequence of deductions but is done by real people with real motivations; the scientific process has a significant social and cultural component, even if the output of science is objective.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
6 comments February 25, 2010
How Google Docs Leaks Your Identity
Recap. In the previous two articles in this Ubercookies series, I showed how an arbitrary website that you visit can learn your identity using the “history stealing” bug in web browsers. In this article I will show how a bug in Google Docs gives any website the same capability in a far easier manner.
Update. A Google Docs team member tells me that a fix should be live later today.
Update 2. Now fixed.
About six weeks ago I discovered that a feature/bug in Google docs can be used to mass harvest e-mail addresses. I noted it in my journal, but soon afterwards I realized that it was much worse: you could actually discover the identity of web visitors using the bug. Recently, Vincent Toubiana and I implemented the attack; here is a video of the demo webpage (on my domain, in no way related to Google) just to show that we got it working.
(You might need to hit pause to read the text.)
I’m not releasing the live demo, since the vulnerability unfortunately still exists (more on this below). Let us now study the attack in more detail.

Bug or feature? Google Spreadsheets has a feature that tells you who else is editing the document. It’s actually really nifty: you can see in real time who is editing which cell, and it even seems to have live chat. The problem is that this feature is available even for publicly viewable documents. Do you see where this is going?
First of all, this is a problem even without the surreptitious use I’m going to describe. Here’s a public spreadsheet I found with 10 seconds of Googling that a few people seem to be viewing when I looked. I’m not sure the author of this document intended it to be publicly viewable or editable.
The attack works by embedding an invisible iframe (dimensions 0×0) into the malicious web page. The iframe loads a public spreadsheet that the attacker has already created. In a separate backend process, the attacker constantly checks the list of people viewing the spreadsheet and records this information. After the iframe is embedded, the Javascript on the page page waits a second or two and queries the attacker’s server to get the username of the user who most recently appeared on the list.
What if multiple people are visiting the page at roughly the same time? It’s not a problem, for two reasons: 1. Google Spreadsheets has a “push” notification system for updating the frontend which enables the attacker to get the identity of the new user virtually instantaneously. 2. To further increase accuracy, the attacker can create (say) 10 spreadsheets and embed a random subset of 5 into any given visitor’s page, making it exceptionally unlikely that there will be a collision.
The only inefficient part of the attack as Toubiana and I have implemented it is that it requires a browser (with a GUI) to be open to monitor the spreadsheet. Browser rendering engines have been modularized into scriptable components, so with a little more effort it should be possible to run this without a display. At present I have it running out of an old laptop tucked away in my dresser
The “backend” is in there
Defense. How can Google fix this bug? There are stop-gap measures, but as far as I can see the only real solution is to disable the collaborator list for public documents. Again a trade-off between functionality and privacy as we saw in the previous article.
Many people responded to my original post saying they were going to stay logged out of Google when they didn’t need to be logged in (since you can’t log out of just Google Docs separately). Unfortunately, that’s not a feasible solution for me, and I suspect many other people. There are at least 3 Google services that I constantly need to keep tabs on; otherwise my entire workflow would come to a screeching halt. So I just have to wait for Google to do something about this bug. Which brings me to my next point:
Great power, great responsibility. There is a huge commercial benefit to becoming an identity provider. As Michael Arrington has repeatedly noted, many Internet companies issue OpenIDs but don’t accept them from other providers, in a race to “own the identity” of as many users as possible. That is of course business as usual, but the players in this race need to wake up to the fact that being an identity provider is asking users for a great deal of trust, whether or not users realize it.
An identity-stealing bug is an (unintentional) violation of that trust because — among many other reasons — it is a precursor to stealing your actual account credentials. (That is particularly scary with Google due to their lack of anything resembling customer service for account issues.) One strategy for stealing account credentials is a phishing page mimicking the Google login page, with your username filled in. Users are much less likely to be suspicious and more likely to respond to messages that have their name on them. Research on social phishing reaches similar conclusions.
I’ve been in contact with people at Google about this bug and I’ve been told a fix is being worked on, specifically that “less presence information will be revealed.” I take it to mean the attack described here won’t work. Since they are making a good-faith effort to fix it, I’m not releasing the demo itself. It has been a long time, though. The Buzz privacy issues were fixed in 4 days, and that kind of urgency is necessary for security issues of this magnitude.
A kind of request forgery. The attack here can be seen as a simpleminded cross-site request forgery. In general, any type of request forgery bug that causes your browser to initiate a publicly recorded interaction on your behalf will immediately leak you identity. For example, if (hypothetically) visiting a URL causes your browser to leave a comment on a specific Youtube video, then the attacker can create a Youtube video and constantly monitor it for comments, mirroring the attack technique used here.
Another technical lesson from this bug is that access control in social networking can be tricky. I’ve written before that privacy in social networking is about a lot more than access control, and that theory doesn’t help determine user reactions to your product. But this bug was an access control issue, and theory would have helped. Websites designing social features would do well to have someone with an academic background thinking about security issues.
Up next. In this post as well as the previous ones, I’ve briefly hinted at what exactly can go wrong if websites can learn your identity. The next post in this series will examine that issue in more detail. Stay tuned — it turns out there’s quite a bit more to say about that, and you might be surprised.
Thanks to Vincent Toubiana for reviewing a draft.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
8 comments February 22, 2010
Ubercookies Part 2: History Stealing meets the Social Web
Recap. In the previous article I introduced ubercookies — techniques that websites can use to de-anonymize visitors. I discussed a recent paper that shows how to use history stealing along with social network group membership information to find the visitor’s identity, and I promised a stronger variant of the attack.
The observation that led me to the attack I’m going to describe is simple: social networking isn’t just about social networks — the whole web has gone social. It’s a view that you quickly internalize if you spend any time hanging out with Silicon Valley web entrepreneurs
Let’s break the underlying principle of the identity-stealing attack down to its essence:
A user leaves a footprint whenever their interaction with a specific web page is recorded publicly.
De-anonymization happens when the attacker can tie these footprints together into “trails” that can then be correlated with the user’s browser history. Efficiently querying the history to identify multiple points on the trail is a challenging problem to solve, but in principle de-anonymization is possible as long as the user’s actions on different web pages happen under the same identity.
Footprints can be tied together into trails as long as all the interactions happen under the same identity. There is no need for the interactions to be on the same website.
There are two major ways in which you can interact with arbitrary websites under a unified identity, both of which are defining principles of the social web. The first is federated identity, which means you can use the same identity provider wherever you go. This is achieved through OpenID and similar mechanisms like Facebook Connect. The second is social sharing: whenever you find something interesting anywhere on the web, you feed it back to your social network.
Now let’s examine the different types of interactions in more detail.
A taxonomy of interaction on the social web.
0. The pre-social web had no social networks and no delegated identity mechanism (except for the failed attempt by Microsoft called Passport). Users created new identities on each website, authenticated via site-specific usernames and passwords to each site separately. The footprints on different sites cannot be tied together; for practical purposes there are no footprints.
1. Social networks: affiliation. In social networks, users interact with social objects and leave footprints when the actions are public. The key type of interaction that is useful for de-anonymization is the expression of affiliation: this refers to not just the group memberships studied in the recent Wondracek et al. paper, but also includes
- memberships of fan pages on Facebook
- “interests” on Livejournal
- follow relationships and plain old friend relationships on Twitter and other public social networks
- subscriptions to Youtube channels
and so on.
All of these interactions, albeit very different from the user perspective, are fundamentally the same concept:
- you “add yourself” to or affiliate yourself with some object on a social network
- this action can be publicly observed
- you almost certainly visited a URL that identifies the object before adding it.
2. The social web: sharing. When you find a page you like — any page at all — you can import it or “share” it to your social stream, on Facebook, Twitter, Google Buzz, or a social bookmarking site like Delicious. The URL of the page is almost certainly in your history, and as long as your social stream is public, your interaction was recorded publicly.
3. The social web: federated identity. When you’re reading a blog post or article on the social web, you can typically comment on it, “like” it, favorite it, rate it, etc. You do all this under your Facebook, Google or other unified identity. These actions are often public and when they are, your footprint is left on the page.
A taxonomy of attacks
The three types of social interactions above give rise to a neat taxonomy of attacks. They involve progressively easier backend processing and progressively more sophisticated history search techniques on the front end. But the execution time on the front-end doesn’t increase, so it is a net win. Here’s a table:
| Type of interaction |
Backend processing |
Type of history URL |
Location of footprint |
| Affiliation | Crawling of social network | Object in a social network | In the social network |
| Sharing | Syndication of social stream(s) from social network | Any page | In the social network |
| Federated identity | None; optional crawling | Any page | On the page |
.
1. Better use of affiliation information. The Wondracek et al. paper makes use of only group membership. One natural reason to choose groups is that there are many groups that are large, with thousands of members, so it gives us a reasonably high chance that by throwing darts in the browser history we will actually hit a few groups that the user has visited. On the other hand, if we try to use the Facebook friend list, for example, hoping to find one of the user’s friends by random chance, it probably won’t work because most users have only a few hundred friends.
But wait: many Twitter users have thousands or even millions of followers. These are known as “hubs” in network theory. Clearly, the attack will work for any kind of hubs that have predictable URLs, and users on Twitter have even more predictable URLs (twitter.com/username) than groups on various networks. The attack will also work using Youtube favorites (which show up by default on the user’s public profile or channel page) and whatever other types of affiliation we might choose to exploit, as long as there are “hubs” — nodes in the graph with high degree. Already we can see that many more websites are vulnerable than the authors envisaged.
2. Syndicating the social stream: my Delicious experiment.
The interesting thing about the social stream is that you can syndicate the stream of interactions, rather than crawling. The reasons why syndication is much easier than crawling are more practical than theoretical. First, syndicated data is intended to be machine readable, and is therefore smaller as well as easier to parse compared to scraping web pages. Second, and more importantly, you might be to get a feed of the entire site-wide activity instead of syndicating each user’s activity stream separately. Delicious allows global syndication; Twitter plans to open this “firehose” feature to all developers soon.
Another advantage of the social stream is that everything is timestamped, so you can limit yourself to recent interactions, which are more likely to be in the user’s history.
Using the delicious.com dataset made available by DAI-labor (a log of all bookmarking activity on delicious.com over several years), I did a simulated experiment using 3 months worth of data: assuming that users keep their history around for 3 months, do in fact visit every link they post on delicious, how many users would a hypothetical history stealing attack be able to identify? I had a pretty good success rate: about 60% of the users who had shared at least 2 links in the 3-month period, or about 300,000 users. This takes at most 4000-5000 Javascript history queries.
Needless to say, once Twitter opens up its firehose, Twitter users (who are far more numerous than delicious users) would also be susceptible to the same technique.
This attack is not possible to fix via server-side URL randomization. It can also be made to work using Facebook, Google Buzz, and other sharing platforms, although the backend processing required won’t be as trivial (but probably no harder than in the original attack.)
3. A somewhat random walk through the history park.
And now for an approach that potentially requires no backend data collection, although it is speculative and I can’t guess what the success rate would be. The attack proceeds in several steps:
- Identify the user’s interests by testing if they’ve visited various popular topic-specific sites. Pick one of the user’s favorite topics. Incidentally, a commenter on my previous post notes he is building exactly this capability using topic pages on Wikipedia, also with the goal of de-anonymization!
- Grab a list of the top blogs on the topic you picked from one of the blog directories. Query the history to see which of these blogs the user reads frequently. It is even possible to estimate the level of interest in a blogs by looking at the fraction of the top/recent posts from that blog that the user has visited. Pick a blog that the user seems to visit regularly.
- Look for evidence of the user leaving comments on posts. For example, on Blogger, the comment page for a post has the URL http://www.blogger.com/comment.g?blogID=<blogid>&postID=<postid>.
- Once you find a couple of posts where it looks like the user made a comment, scrape the list of people who commented on it, find the intersection. (Even a single comment might suffice; as long as you have a list of candidates, you easily verify if it’s one of them by testing user-specific URLs. More below.)
- Depending on the blogging platform, you might even be able to deduce that the user responded (or intended to respond) to a specific comment. For example, On wordpress you have the pattern http://<blogname>.wordpress.com/<postname>/?replytocom=<commentid>#respond. If you get lucky and find one of those patterns, that makes things even easier.
If at first you don’t succeed, pick a different blog and repeat.
I suspect that the most practical method would be to use a syndicated activity stream from a social network, but also to use the heuristics presented above to more efficiently search through the history.
Epilogue: Identity.
Not only has there been a movement towards a small number of identity providers on the web, there are many aggregators out there that have sprung up in order to automatically find the connections between identities across the different identity providers, and also connect online identities to physical-world databases. As Pete Warden notes:
One of the least-understood developments of the last few years is the growth of databases of personal information linked to email addresses. Rapleaf is probably the leader in this field, but even Flickr lets companies search their API for users based on an email address.
I ran my email address through his demo script and it is quite clear that virtually all of my online identities have been linked together. This is getting to be the norm; as a consequence, once an attacker gets any kind of handle on you, they can go “identity hopping” and find out a whole lot more about you.
This is also the reason that once the attacker can make a reasonable guess at the visitor’s identity, it’s easy to verify the guess. Not only can they look for user-specific URLs in your history to confirm the guess (described in detail in the Wondracek et al. paper), but all your social streams on other sites can also be combined with your history to corroborate your identity.
Up next in the Ubercookies series: So that’s pretty bad. But it’s going to get worse before it can get better
In the next article, I will describe an entirely different attack strategy to get at your identity by exploiting a bug in a specific identity provider’s platform.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
3 comments February 19, 2010
Privacy is not Access Control (But then what is it?)
In my previous article on the Google Buzz fiasco, I pointed out that the privacy problems were exacerbated by the fact that the user interface was created by programmers. In this post I will elaborate on that theme and provide some constructive advice on privacy-conscious design, especially for social networking.
The problem I’m addressing is that as far as computer scientists and computer programmers are concerned, privacy is a question of access control, i.e., who is allowed to look at what. Unfortunately, in the real world, that is only a tiny part of what privacy is about. Here are three examples to make my point:
1. Dummy cameras. Consider a thought experiment: suppose the government installed a bunch of cameras all over a public park along with prominent signs announcing 24×7 surveillance. The catch, however, is that the cameras have not been turned on. Has anyone’s privacy been violated?
From the computer science perspective, the answer is no, because no one is actually being observed, nothing is being recorded and no data is being generated. But common sense tells us that something is wrong with that answer. The cameras cause people considerable discomfort. The surveillance, real or imaginary, changes their behavior.
This hypothetical scenario is adapted from Ryan Calo’s paper, which analyzes in detail the “sensation of being observed.”
2. Aggregation changes the equation. Remember the uproar when Facebook released News Feed? No new information was revealed to your friends that wasn’t accessible to them before; it was just that the News Feed made it dramatically easier to observe all your activities on the site.
Of course, it goes both ways: the technology in turn changed people’s expectations; it is now hard to imagine not having a feed-like system, whether on Facebook or another social network. Nevertheless, I often see people putting something into their profile, deciding a few moments later that they didn’t want to share it after all, and realizing that it was too late because the information has already been broadcast to their friends.
3. Everyone-but-X access control, which I described in an earlier article, shows in a direct way how access control fails to capture privacy requirements. From the traditional CS security perspective, the ability for a user to make something visible to “everyone but X” is meaningless: X can always create a fake account to get around it.
But a use-case should hopefully immediately convince you that everyone-but-X is a good idea: your sibling is on your friends list and you want to post about your sex life. It’s not that you want to prevent X from having access to your post, but rather that both of you prefer that X didn’t have access to it.
Access control is not the goal of privacy design. It is at best one of many tools. Rather, human behavior is key. The dummy cameras were bad because they affected the behavior of people in a detrimental way. News feed was bad because it introduced major new privacy consequences for the behaviors that people were accustomed to on the site. (However, I would argue that the dramatic increase in usefulness trumped the privacy drawbacks.) Everyone-but-X privacy is good because it allows people to carry over to the online setting behaviors that they are used to in the real world.
It is impossible to fully analyze the privacy consequences of a design decision without studying its impact on actual user behavior. There is no theoretical framework to ensure that a design decision is safe — user testing is essential. Going back to Google Buzz, a beta period or a more gradually phased roll-out would have undoubtedly been better.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
8 comments February 13, 2010




