Is Making Public Data “More Public” a Privacy Violation?
What on earth does more public mean? Technologists draw a simple distinction between data that is public and data that is not. Under this view, the notion of making data more public is meaningless. But common sense tells us otherwise: it’s hard to explain the opposition to public surveillance if you assume that it’s OK to collect, store and use “public” information indiscriminately.
There are entire philosophical theories devoted to understanding what one can and cannot do with public data in different contexts. Recently, danah boyd argued in her SXSW keynote in support of “privacy through obscurity” and how technology is destroying this comfort. According to boyd, most public data is “quasi-public” and technologists don’t have the right to “publicize” it.
Some examples. One can debate the point in the abstract, but there is no question that companies and individuals have repeatedly been bitten when applying the “it’s already public” rule. Let’s look at some examples (the list and the discussion is largely concerned with data on the web).
- The availability of the California Birth Index on the web caused considerable consternation about a decade ago, despite the fact that birth records in the state are public and anyone’s birth record can be obtained through official channels albeit in a cumbersome manner.
- IRSeek planned to launch a search engine for IRC in 2007 by monitoring and indexing public channels (chatrooms). There was a predictable privacy outcry and they were forced to shut down.
- The Infochimps guys crawled the Twitter graph back in 2008 and posted it on their site. Twitter forced them to take the dataset down.
- The story was repeated with Pete Warden and Facebook; this time it was nastier and involved the threat of a lawsuit.
- MySpace recently started selling user data in bulk on Infochimps. As MySpace has pointed out, the data is already public, but privacy concerns have nevertheless been raised.
- One reason for the backlash against Google Buzz was auto-connect: it connected your activity on Google Reader and other services and streamed it to your friends. Your Google Reader activities were already public, but Buzz took it further by broadcasting it.
- Spokeo is facing similar criticism. As Snopes explains, “Spokeo displays listings that sometimes contain more personal information than many people are comfortable having made publicly accessible through a single, easy-to-use search site.”
The latter four examples are all from the last couple of months. For some reason the issue has suddenly started cropping up all the time. The current situation is bad for everyone: data trustees and data analysts have no clear guidelines in place, and users/consumers are in a position of constantly having to fight back against a loss of privacy. We need to figure out some ground rules to decide what uses of public data on the web are acceptable.
Why not “none?” I don’t agree with a blanket argument against using data for purposes other than originally intended, for many reasons. The first is that users’ privacy expectations, when they go beyond the public/private dichotomy, are generally poorly articulated, frequently unreasonable and occasionally self-contradictory. (An unfortunate but inevitable consequence of the complexity of technology.) The second reason is that these complex privacy rules, even if they can be figured out, often need to be communicated to the machine.
The third reason is the “greater good.” I’ve opposed that line of reasoning when used to justify reneging on an explicit privacy promise. But when it comes to a promise that was never actually made but merely intuitively understood (or mis-understood) by users, I think the question is different, and my stance is softer. Privacy needs to be weighed against the benefit to society from “publicizing” data — disseminating, aggregating and analyzing it.
In the next article of this series, I will give a rigorous technical characterization of what constitutes publicizing data. My hope is that this will go a long way towards determining what is and is not a violation of privacy. In the meanwhile, I look forward to hearing different opinions.
Thanks to Pete Warden and Vimal Jeyakumar for comments on a draft.