What Every Developer Needs to Know About “Public” Data and Privacy

July 6, 2010 at 7:43 pm 5 comments

It is natural for developers building web applications to operate under a public/private dichotomy, the assumption being that if a user made a piece of data public, then they’ve given up any privacy expectation. But as we saw in a previous article, users often expect more subtle distinctions, and many unfortunate privacy blunders have resulted. To avoid repeats of these, engineers need to be able to reason about the privacy implications of specific technical features. This article presents a set of criteria for doing so.

1. Archiving

Computers are designed to keep data around forever unless explicitly deleted. But this assumption makes many nontechnical people deeply uncomfortable. There have been a number of proposals to “make the Internet forget,” bringing it in line with humans’ anthropomorphic expectations. While nothing much will probably result from these broad proposals, there need to be some controls on archiving, especially by third parties. Here are three examples that illustrate why this is important:

  • A woman was fired from her job recently because of her employer found some of her online revelations objectionable. She got caught because Topsy, a Twitter search engine, retained her personal data in its cache even after she had deleted it from Twitter.
  • Joe Bonneau revealed that the vulnerability of photo-sharing sites failing to delete photos from their CDN caches persists on many sites, a full year after it was first made public and received media attention.
  • Facebook acted in a heavy-handed manner in its recent spat with Pete Warden. The company’s rationale for prohibiting crawlers seems to be that they want to impose fine-grained restrictions on third party data use. Nontrivial policies can be specified via the Terms of Use, but not via robots.txt.

The examples above show a clear need for a standard for machine-readable third-party data retention policy — a robots.txt on steroids, if you will. Pete Warden proposed expanding robots.txt a few months ago; now that multiple sites are facing this problem, perhaps there will be some momentum in this direction.

2. Real-time

The real-time web relies on “pushing” updates to clients instead of the traditional model of crawling. The push model greatly improves timeliness and machine load, but the problem is that there is typically no way to delete or update existing items in real-time.

This fact bites me on a regular basis. When I make a blog post, Google reader gets hold of it immediately, but if I realize I wrote something stupid and update the post, it doesn’t show up for several hours because updates don’t propagate through the real-time mechanism.

Or consider tweets: if you tweet something inappropriate and delete it a second later, it might be too late: Twitter’s partners could have already gotten hold of it through the “firehose,” and it might already be displayed on a sidebar on some other site.

Google’s “undo send” feature is a great solution to this type of problem — it holds the message in a queue for a few seconds before sending it out. Every real-time system needs such a panic feature!

3. Search

While making data searchable greatly increases its utility, it also dramatically increases the privacy risks. It is tempting to tell users to get used to the fact that everything they write is searchable, but that hasn’t been successful so far, as IRSeek found out when they tried to launch an IRC search engine. There are entire companies like ReputationDefender that help you clean up the web search results for your name.

The lack of searchability of your site can be a feature. This is obviously not true for the majority of sites, but it is worth keeping in mind. One major reason why LiveJournal has a “closed” feel — which is a big part of its appeal — is that posts don’t rank well in Google searches, if they are indexed at all. For example, Livejournal posts have a numeric ID instead of title words in the URL. Although it sounds like someone skipped SEO 101, it is actually by design.

4. Aggregation

By aggregate data I mean data from a single source or website, comprising all or a significant fraction of the users. The appeal of aggregate data for research is clear: not only are larger quantities better, aggregation avoids the bias problems of sampling. On the other hand, the privacy concerns are also clear: the fear is that the data will end up at the hands of the wrong people, such as one of the database marketing companies.

Aggregation is the most common of the privacy problems among the 7 examples I listed in my previous article. In some cases the original source made the data available and then backtracked, in other cases a third party crawled the data and got into trouble, and some were a mix of both.

For websites sitting on interesting data, an excellent compromise would be in-house data analysis (or perhaps a partnership program with outside researchers), as an alternative to making data public. OkCupid has been doing this extremely well, in my opinion — they have a great series of blog posts on race, looks and everything else that affects online dating. The man-hours spent on data analysis are well worth the increased pageviews and mindshare. Facebook has a data team as well, but given the quantity of data they have, they could be publishing quite a bit more.

5. Linkage

By linkage I refer to connecting the same person across multiple websites. Confusingly, this is sometimes referred to as aggregation. Linkage can take the form of database marketers connecting different databases of personal information, or in the online context, it can take the form of tools that link together individual profiles on different websites.

Pervasive online identities are becoming the norm, which is something I’ve been writing about. All of your online activities are going to be easily linkable sooner or later unless you explicitly take steps to keep your identities separate. But again, users haven’t quite woken up to this yet. Unwanted linkage is therefore something that can upset users greatly. The auto-connect feature in Google Buzz is the best example. Opt-in rather than opt-out is probably the way to go, at least for a few years until everyone gets used to it.

Summary. While well-understood access control principles tell us how to implement the privacy of data marked private, the privacy of “public” data is just as big a concern. So far there has been no systematic way of analyzing exactly what it is that users object to. In this article I’ve presented five such features. To avoid nasty surprises, developers building websites need to think carefully about privacy and user behavior when implementing any of these features.

Thanks to Ann Kilzer for reviewing a draft.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

Entry filed under: Uncategorized. Tags: , , , , , , .

Myths and Fallacies of “Personally Identifiable Information” Thoughts on the White House/DHS Identity Plan

5 Comments Add your own

  • 1. Marc Hedlund  |  July 6, 2010 at 11:31 pm

    The punch line is, I can’t log out of WordPress.com from this page – I always see it with the WordPress toolbar at the top, and when I hit “Logout” that does not remove the toolbar on reload. Very aggressive linkage. :)

    Otherwise, good post…

    Reply
  • 2. Stephen Wilson  |  July 16, 2010 at 2:05 am

    Well done, I applaud every effort to alert technologists to the subtleties of privacy law. Significant parts of privacy law are counter intuitive to technologists.

    So I think the basic advice can be even simpler. In particular, technologists should try to avoid the terms “private” and “public”. They don’t figure much in Information Privacy law. Instead the operable concept is “personally identifiable information”. If information is personally identifiable, then the law constrains what an organisation can do with it, regardless of where the information came from. Hence personal information gathered from the public domain (like wifi data hoovered up by Google Streetview cars) is not a freely available resource!

    Reply
  • […] the next article of this series, I will give a rigorous technical characterization of what constitutes publicizing […]

    Reply
  • 4. Annie  |  August 9, 2010 at 7:51 am

    re:Searchability, here is a paper that discusses a possible tradeoff between security and searchability:

    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.150.899&rep=rep1&type=pdf

    Reply
    • 5. Arvind  |  August 9, 2010 at 8:44 am

      Hmm, citeseer seems to be down.. could you tell me the title so I can Google it? Thanks

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


About 33bits.org

I'm an assistant professor of computer science at Princeton. I research (and teach) information privacy and security, and moonlight in technology policy.

This is a blog about my research on breaking data anonymization, and more broadly about information privacy, law and policy.

For an explanation of the blog title and more info, see the About page.

Subscribe

Be notified when there's a new post — subscribe to the feed, follow me on Google+ or twitter or use the email subscription box below.

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 245 other followers