Posts tagged ‘conference’
Last week I participated in the Web Privacy Measurement conference at Berkeley. It was a unique event because the community is quite new and this was our very first gathering. The WSJ Data Transparency hackathon is closely related; the Berkeley conference can be thought of as an academic counterpart. So it was doubly fascinating for me — both for the content and because of my interest in the sociology of research communities.
A year ago I explained that there is an information asymmetry when it comes to online privacy, leading to a “market for lemons.” The asymmetry exists for two main reasons: one is that companies don’t disclose what data they collect about you and what they do with it; the second is that even if they do, end users don’t have the capacity to aggregate and process that information and make decisions on the basis of it.
The Web Privacy Measurement community essentially exists to mitigate this asymmetry. The primary goal is to ferret out what is happening to your data online, and a secondary one is making this information useful by pushing for change, building tools for opt-out and control, comparison of different players, etc. The size of the community is an indication of how big the problem has gotten.
Before anyone starts trotting out the old line, “see, the market can solve everything!”, let me point out that the event schedule demonstrates, if anything, the opposite. The majority of what is produced here is intended wholly or partly for the consumption of regulators. Like many others, I found the “What privacy measurement is useful for policymakers?” panel to be the most interesting one. And let’s not forget that most of this is Government-funded research to begin with.
This community is very different from the others that I’ve belonged to. The mix of backgrounds is extraordinary: researchers mainly from computing and law, and a small number from other disciplines. Most of the researchers are academics, but a few work for industrial research labs, a couple are independent, and one or two work in Government. There were also people from companies that make privacy-focused products/services, lawyers, hobbyists, scholars in the humanities, and ad-industry representatives. Overall, the community has a moderately adversarial relationship with industry, naturally, and a positive relationship with the press, regulators and privacy advocates.
The make-up is somewhat similar to the (looser-knit) group of researchers and developers building decentralized architectures for personal data, a direction that my coauthors and I have taken a skeptical view of in this recent paper. In both cases, the raison d’être of the community is to correct the imbalance of power between corporations and the public. There is even some overlap between the two groups of people.
The big difference is that the decentralization community, typified by Diaspora, mostly tries to mount a direct challenge and overthrow the existing order, whereas our community is content to poke, measure, and expose, and hand over our findings to regulators and other interested parties. So our potential upside is lower — we’re not trying to put a stop to online tracking, for example — but the chance that we’ll succeed in our goals is much higher.
Thanks to Aleecia McDonald for reviewing a draft.
Privacy norms, rules and expectations in the real world go far beyond the “public/private” dichotomy. Yet in the realm of web crawler access control, we are tied to this binary model via the robots.txt allow/deny rules. This position paper describes some of the resulting problems and argues that it is time for a more sophisticated standard.
The problem: privacy of public data. The first author has argued that individuals often expect privacy constraints on data that is publicly accessible on the web. Some examples of such constraints relevant to the web-crawler context are:
- Data should not be archived beyond a certain period (or at all).
- Crawling a small number of pages is allowed, but large-scale aggregation is not.
- “Linkage” of personal information to other databases is prohibited.
Currently there is no way to specify such restrictions in a machine-readable form. As as result, sites resort to hacks such as identifying and blocking crawlers whose behavior they don’t like, without clearly defining acceptable behavior. Other sites specify restrictions in the Terms of Service and bring legal action against violators. This is clearly not a viable solution — for operators of web-scale crawlers, manually interpreting and encoding the ToS restrictions of every site is prohibitively expensive.
There are two reasons why the problem has become pressing: first, there is an ever-increasing quantity of behavioral data about users that is valuable to marketers — in fact, there is even a black market for this data — and second, crawlers have become very cheap to set up and operate.
The desire for control over web content is by no means limited to user privacy concerns. Publishers concerned about copyright are equally in search of a better mechanism for specifying fine-grained restrictions on the collection, storage and dissemination of web content. Many site owners would also like to limit the acceptable uses of data for competitive reasons.
The solution space. Broadly, there are three levels at which access/usage rules may be specified: site-level, page-level and DOM element-level. Robots.txt is an example of a site-level mechanism, and one possible solution is to extend robots.txt. A disadvantage of this approach, however, is that the file may grow too large, especially in sites with user-generated content what may wish to specify per-user policies.
A page-level mechanism thus sounds much more suitable. While there is already a “robots” attribute to the META tag, it is part of the robots.txt specification and has the same limitations on functionality. A different META tag is probably an ideal place for a new standard.
Taking it one step further, tagging at the DOM element-level using microformats to delineate personal information has also been proposed. A possible disadvantage of this approach is the overhead of parsing pages that crawlers will have to incur in order to be compliant.
Conclusion. While the need to move beyond the current robots.txt model is apparent, it is not yet clear what should replace it. The challenge in developing a new standard lies in accommodating the diverse requirements of website operators and precisely defining the semantics of each type of constraint without making it too cumbersome to write a compliant crawler. In parallel with this effort, the development of legal doctrine under which the standard is more easily enforceable is likely to prove invaluable.
I attended a couple of conferences this week that are outside my usual community. Taking stock of and interacting with a new crowd is always a very interesting experience.
The first was the IAPP Practical Privacy Series. The International Association of Privacy Professionals came about as a result of the fact that the Chief Privacy Officer (and equivalent) positions have suddenly emerged — over the last decade — and become ubiquitous. The role can be broadly described as “privacy compliance.” A big part of the initial impetus seems to have been HIPAA compliance, but the IAPP composition has now diversified greatly, because virtually every company is sitting on a pile of consumer data. There was even someone from Starbucks.
I spoke about anonymization. I was trying to answer the question, “I need to share/sell my data and you’re telling me that anonymization is broken. So what should I do?”. It’s always a fun challenge to make computer science accessible to a non-tech audience (largely lawyers in this case). I think I managed reasonably well.
Next was the ACM Computers, Freedom and Privacy conference (which goes on until Friday). As I understand it, CFP was born at a time when “Cyberspace” was analogous to the Wild West, and there was a big need for self-governance and figuring out the emerging norms. The landscape is of course very different now, since the Internet isn’t a band of outlaws anymore but integrated into normal society. The conference has accordingly morphed somewhat, although a lot of the old crowd still definitely comes here.
The quality of the events I attended were highly variable. I checked out the “unconferences,” but only a couple had a meaningful level of participation and the one I went to seemed to devolve pretty quickly into a penis-waving contest. The session I liked best was a tutorial by Mike Godwin (of Godwin’s law, now counsel for the Wikimedia foundation) on Cyberlaw, mainly First Amendment law.
CFP has parallel sessions. I had a great experience with that format at the Privacy Law Scholars Conference, but this time I’m not so sure — I’m regularly finding conflicts among the sessions I want to attend.
I’m bummed about the fact that there is really no mechanism for me to learn about conferences that are relevant to my interests but are outside my community. (I only learned about the IAPP workshop because I was invited to speak, and CFP purely coincidentally.) Do other researchers face this problem as well? I’m curious to hear about how people keep abreast. I mean, it’s 2010, and this is exactly the kind of problem that social media is supposed to be great at solving, but it’s not really working for me.
I had a great time at the Privacy Law Scholars Conference in Berkeley last week, perhaps more so than at any CS conference I’ve attended. A major reason was that there were — get this — no talks. Well, just one keynote speech. The format centered around 75 minutes-long discussion sessions (which seem to be called workshops), with 5 parallel tracks; in each session, you pick which track you want to attend. You are supposed to have read the paper beforehand, and usually everyone in the room has something to say and gets a chance to do so.
This seems way more sensible to me than the format of CS conferences, where there is only one track. I can’t imagine that anyone would genuinely want to attend all the talks. Ideally, for any given talk, half the people should skip it and spend their time networking instead, but in my experience this never happens. Worse, the talks are only 20-30 minutes long; while this is enough time to motiviate the paper and inspire the listeners to go read it afterward, it is never enough to explain the whole paper. Sometimes speakers don’t get this concept, and the results are not pretty.
Anyways, I was surprised by the ease with which I could read law papers and participate in the discussions, even if my understanding was (obviously) not nearly as deep as that of a law scholar. This is something to ponder — while legalese is dense and frequently obfuscated, law papers are a breeze to read, at least based on my small sample size.
There is one paper, by Paul Ohm, that I particularly enjoyed: it is about re-examining privacy laws and regulatory strategies in the light of re-identification techniques. This generated a lot of interest at the conference, and I found the discussion fascinating. A major reason I started 33bits was to to be able to play a part in informing these developments; it seems that this blog has indeed helped, which is highly gratifying. I learnt a lot about privacy and anonymity in general, and I look forward to writing more about it in future posts, to the extent that I can do so without talking about specific workshop discussions, which are confidential.