What Every Developer Needs to Know About “Public” Data and Privacy
July 6, 2010 at 7:43 pm
It is natural for developers building web applications to operate under a public/private dichotomy, the assumption being that if a user made a piece of data public, then they’ve given up any privacy expectation. But as we saw in a previous article, users often expect more subtle distinctions, and many unfortunate privacy blunders have resulted. To avoid repeats of these, engineers need to be able to reason about the privacy implications of specific technical features. This article presents a set of criteria for doing so.
Computers are designed to keep data around forever unless explicitly deleted. But this assumption makes many nontechnical people deeply uncomfortable. There have been a number of proposals to “make the Internet forget,” bringing it in line with humans’ anthropomorphic expectations. While nothing much will probably result from these broad proposals, there need to be some controls on archiving, especially by third parties. Here are three examples that illustrate why this is important:
- A woman was fired from her job recently because of her employer found some of her online revelations objectionable. She got caught because Topsy, a Twitter search engine, retained her personal data in its cache even after she had deleted it from Twitter.
- Joe Bonneau revealed that the vulnerability of photo-sharing sites failing to delete photos from their CDN caches persists on many sites, a full year after it was first made public and received media attention.
The examples above show a clear need for a standard for machine-readable third-party data retention policy — a robots.txt on steroids, if you will. Pete Warden proposed expanding robots.txt a few months ago; now that multiple sites are facing this problem, perhaps there will be some momentum in this direction.
The real-time web relies on “pushing” updates to clients instead of the traditional model of crawling. The push model greatly improves timeliness and machine load, but the problem is that there is typically no way to delete or update existing items in real-time.
This fact bites me on a regular basis. When I make a blog post, Google reader gets hold of it immediately, but if I realize I wrote something stupid and update the post, it doesn’t show up for several hours because updates don’t propagate through the real-time mechanism.
Or consider tweets: if you tweet something inappropriate and delete it a second later, it might be too late: Twitter’s partners could have already gotten hold of it through the “firehose,” and it might already be displayed on a sidebar on some other site.
Google’s “undo send” feature is a great solution to this type of problem — it holds the message in a queue for a few seconds before sending it out. Every real-time system needs such a panic feature!
While making data searchable greatly increases its utility, it also dramatically increases the privacy risks. It is tempting to tell users to get used to the fact that everything they write is searchable, but that hasn’t been successful so far, as IRSeek found out when they tried to launch an IRC search engine. There are entire companies like ReputationDefender that help you clean up the web search results for your name.
The lack of searchability of your site can be a feature. This is obviously not true for the majority of sites, but it is worth keeping in mind. One major reason why LiveJournal has a “closed” feel — which is a big part of its appeal — is that posts don’t rank well in Google searches, if they are indexed at all. For example, Livejournal posts have a numeric ID instead of title words in the URL. Although it sounds like someone skipped SEO 101, it is actually by design.
By aggregate data I mean data from a single source or website, comprising all or a significant fraction of the users. The appeal of aggregate data for research is clear: not only are larger quantities better, aggregation avoids the bias problems of sampling. On the other hand, the privacy concerns are also clear: the fear is that the data will end up at the hands of the wrong people, such as one of the database marketing companies.
Aggregation is the most common of the privacy problems among the 7 examples I listed in my previous article. In some cases the original source made the data available and then backtracked, in other cases a third party crawled the data and got into trouble, and some were a mix of both.
For websites sitting on interesting data, an excellent compromise would be in-house data analysis (or perhaps a partnership program with outside researchers), as an alternative to making data public. OkCupid has been doing this extremely well, in my opinion — they have a great series of blog posts on race, looks and everything else that affects online dating. The man-hours spent on data analysis are well worth the increased pageviews and mindshare. Facebook has a data team as well, but given the quantity of data they have, they could be publishing quite a bit more.
By linkage I refer to connecting the same person across multiple websites. Confusingly, this is sometimes referred to as aggregation. Linkage can take the form of database marketers connecting different databases of personal information, or in the online context, it can take the form of tools that link together individual profiles on different websites.
Pervasive online identities are becoming the norm, which is something I’ve been writing about. All of your online activities are going to be easily linkable sooner or later unless you explicitly take steps to keep your identities separate. But again, users haven’t quite woken up to this yet. Unwanted linkage is therefore something that can upset users greatly. The auto-connect feature in Google Buzz is the best example. Opt-in rather than opt-out is probably the way to go, at least for a few years until everyone gets used to it.
Summary. While well-understood access control principles tell us how to implement the privacy of data marked private, the privacy of “public” data is just as big a concern. So far there has been no systematic way of analyzing exactly what it is that users object to. In this article I’ve presented five such features. To avoid nasty surprises, developers building websites need to think carefully about privacy and user behavior when implementing any of these features.
Thanks to Ann Kilzer for reviewing a draft.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
Entry filed under: Uncategorized. Tags: aggregation, archiving, data, linkage, privacy, real-time, search.