Women in Tech: How Anonymity Contributes to the Problem

Like Michael Arrington, I too have sat on the sidelines of the debate on women in tech. Unlike Michael Arrington, I did so because nobody asked for my opinion. There is, however, one aspect of the debate that I’m qualified to comment on.

The central issue seems to be whether the low participation rate of women in technology is due to a hostile environment in the tech industry (e.g., sexism, overt or covert) or due to external factors, whether genetic or social, that influence women to pick career paths other than technology without even giving it a shot.

Arrington thinks it’s the latter, and makes a strong case for his position. In response, many have pointed out various behaviors common in the tech industry that make it unappealing to women. Jessica B. Hamrick talks about rampant elitism which affects women disproportionately. What I’m more interested in today is Michelle Greer’s account of being viciously attacked for a relatively innocuous comment on Arrington’s post.

Let me come right out and say it: while I am a defender of the right to anonymous speech, I believe it has no place whatsoever in the vast majority of discussion forums. The reason is simple: there is something about anonymity that completely dismantles our evolved social norms and civility and makes us behave like apes. Not all of us, to be sure, but it only takes a few to ruin it for everyone. Or to put it in plainer terms:

There is no doubt that sexist comments online — the vast majority of them anonymous — contribute hugely to the problem of tech being a hostile environment for women. While there are rude comments directed at everyone, just look around if you need convincing that the ones that attack someone specifically for being female tend to be much more depraved. It is also true that rude behavior online is not limited to tech fields, but it creates more of a barrier there because online participation is essential for being relevant.

Here’s my suggestion to everyone who’d like to do something to make tech less hostile to women: perhaps the best return on your time that you can get is by making anonymous, unmoderated comments a thing of the past. Abolish it on your own sites, and write to other site admins and educate them about the importance of this issue. And when you see an uncivil comment, either educate or ignore the person, but try not to get enraged — you’d be feeding the troll.

Thanks to Ann Kilzer for reviewing a draft.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

4 comments August 30, 2010

Thoughts on the White House/DHS Identity Plan

The White House and the Department of Homeland Security have come out with an initiative called the National Strategy for Trusted Identities in Cyberspace (NSTIC). Before you ask, no, government people don’t ever plan to stop using the word “cyberspace.”☺

The NSTIC is a vision for identities online, with the Government exerting a significant level of authority over the system and the implementation being carried out largely by the private sector. I earlier submitted technical comments on the draft coauthored with Stanford colleague Jonathan Mayer. This post reflects my personal view of the NSTIC.

Depth. The first thing one notices about the NSTIC strategy document is that it is quite high-level. A lot of the details are going to hinge on a separate implementation plan document. There is an early draft of the implementation document being kicked around, but it is not yet available to the public. The public comment period for the strategy document has closed.

Scope. Identity is a highly overloaded term. The NSTIC doesn’t offer a clear definition; in our comment we analyzed the different flavors and aspects of identity that are being addressed:  (i) plain old self-asserted online identity (ii) linking real-world ID to online ID (iii) public-key infrastructure (iv) attribute authentication (v) anonymous credentials, (vi) credential management and (vii) identity interoperability. The NSTIC tries to do a lot.

Why identity? What problems does it solve? Many others have commented on the fact that the hard problems of cybersecurity, such as malware, cannot really be solved by identity; moreover, solving them might be a requirement for getting an identity infrastructure to work, rather than an outcome. In our technical document we offer a detailed breakdown of what security threats exist today and how the identity plan addresses (or fails to address) each of them.

The NSTIC claims to have been developed in response to a variety of cybersecurity threats. It appears to me, however, that the main goal here is to develop an identity system, and the cybersecurity motivations were tacked on as afterthoughts. While I don’t see a problem with wanting an identity infrastructure for its own sake, it is important not to view it as some kind of panacea.

Process. The only public comment process was via the ideascale website and lasted about 3 weeks. Creating a Web 2.0 crowdsourcing site with a cute theme and opening up the gates is a solution to some problems, but not all. Sometimes actual expertise is required. I find it hilarious that one of the few comments with a deep grasp of the technical issues — by the ACM — was voted down to a -2.

It’s unfortunate that there was no effort to get thet input of CS security researchers in the drafting of the NSTIC. There are many aspects of the plan whose implementation involves research questions, and it is not clear how they are going to be solved. I sure hope there will be more outreach to the security community as the process moves forward.

Summary. I definitely see a role for government leadership in the identity space, and I am glad that this is being worked on. At the same time, there are a variety of concerns with the proposal, including scope creep (look what happened to SSN), malware and software security, hardware security and vested interests (especially since they’re talking about smart cards, DRM, etc.), and usability. It is too early to tell how this is going to turn out. Let’s keep our fingers crossed.

Thanks to Jonathan Mayer for comments on a draft of this blog post.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

Add comment August 18, 2010

What Every Developer Needs to Know About “Public” Data and Privacy

It is natural for developers building web applications to operate under a public/private dichotomy, the assumption being that if a user made a piece of data public, then they’ve given up any privacy expectation. But as we saw in a previous article, users often expect more subtle distinctions, and many unfortunate privacy blunders have resulted. To avoid repeats of these, engineers need to be able to reason about the privacy implications of specific technical features. This article presents a set of criteria for doing so.

1. Archiving

Computers are designed to keep data around forever unless explicitly deleted. But this assumption makes many nontechnical people deeply uncomfortable. There have been a number of proposals to “make the Internet forget,” bringing it in line with humans’ anthropomorphic expectations. While nothing much will probably result from these broad proposals, there need to be some controls on archiving, especially by third parties. Here are three examples that illustrate why this is important:

  • A woman was fired from her job recently because of her employer found some of her online revelations objectionable. She got caught because Topsy, a Twitter search engine, retained her personal data in its cache even after she had deleted it from Twitter.
  • Joe Bonneau revealed that the vulnerability of photo-sharing sites failing to delete photos from their CDN caches persists on many sites, a full year after it was first made public and received media attention.
  • Facebook acted in a heavy-handed manner in its recent spat with Pete Warden. The company’s rationale for prohibiting crawlers seems to be that they want to impose fine-grained restrictions on third party data use. Nontrivial policies can be specified via the Terms of Use, but not via robots.txt.

The examples above show a clear need for a standard for machine-readable third-party data retention policy — a robots.txt on steroids, if you will. Pete Warden proposed expanding robots.txt a few months ago; now that multiple sites are facing this problem, perhaps there will be some momentum in this direction.

2. Real-time

The real-time web relies on “pushing” updates to clients instead of the traditional model of crawling. The push model greatly improves timeliness and machine load, but the problem is that there is typically no way to delete or update existing items in real-time.

This fact bites me on a regular basis. When I make a blog post, Google reader gets hold of it immediately, but if I realize I wrote something stupid and update the post, it doesn’t show up for several hours because updates don’t propagate through the real-time mechanism.

Or consider tweets: if you tweet something inappropriate and delete it a second later, it might be too late: Twitter’s partners could have already gotten hold of it through the “firehose,” and it might already be displayed on a sidebar on some other site.

Google’s “undo send” feature is a great solution to this type of problem — it holds the message in a queue for a few seconds before sending it out. Every real-time system needs such a panic feature!

3. Search

While making data searchable greatly increases its utility, it also dramatically increases the privacy risks. It is tempting to tell users to get used to the fact that everything they write is searchable, but that hasn’t been successful so far, as IRSeek found out when they tried to launch an IRC search engine. There are entire companies like ReputationDefender that help you clean up the web search results for your name.

The lack of searchability of your site can be a feature. This is obviously not true for the majority of sites, but it is worth keeping in mind. One major reason why LiveJournal has a “closed” feel — which is a big part of its appeal — is that posts don’t rank well in Google searches, if they are indexed at all. For example, Livejournal posts have a numeric ID instead of title words in the URL. Although it sounds like someone skipped SEO 101, it is actually by design.

4. Aggregation

By aggregate data I mean data from a single source or website, comprising all or a significant fraction of the users. The appeal of aggregate data for research is clear: not only are larger quantities better, aggregation avoids the bias problems of sampling. On the other hand, the privacy concerns are also clear: the fear is that the data will end up at the hands of the wrong people, such as one of the database marketing companies.

Aggregation is the most common of the privacy problems among the 7 examples I listed in my previous article. In some cases the original source made the data available and then backtracked, in other cases a third party crawled the data and got into trouble, and some were a mix of both.

For websites sitting on interesting data, an excellent compromise would be in-house data analysis (or perhaps a partnership program with outside researchers), as an alternative to making data public. OkCupid has been doing this extremely well, in my opinion — they have a great series of blog posts on race, looks and everything else that affects online dating. The man-hours spent on data analysis are well worth the increased pageviews and mindshare. Facebook has a data team as well, but given the quantity of data they have, they could be publishing quite a bit more.

5. Linkage

By linkage I refer to connecting the same person across multiple websites. Confusingly, this is sometimes referred to as aggregation. Linkage can take the form of database marketers connecting different databases of personal information, or in the online context, it can take the form of tools that link together individual profiles on different websites.

Pervasive online identities are becoming the norm, which is something I’ve been writing about. All of your online activities are going to be easily linkable sooner or later unless you explicitly take steps to keep your identities separate. But again, users haven’t quite woken up to this yet. Unwanted linkage is therefore something that can upset users greatly. The auto-connect feature in Google Buzz is the best example. Opt-in rather than opt-out is probably the way to go, at least for a few years until everyone gets used to it.

Summary. While well-understood access control principles tell us how to implement the privacy of data marked private, the privacy of “public” data is just as big a concern. So far there has been no systematic way of analyzing exactly what it is that users object to. In this article I’ve presented five such features. To avoid nasty surprises, developers building websites need to think carefully about privacy and user behavior when implementing any of these features.

Thanks to Ann Kilzer for reviewing a draft.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

5 comments July 6, 2010

Myths and Fallacies of “Personally Identifiable Information”

I have a new paper (PDF) with Vitaly Shmatikov in the June issue of the Communications of the ACM. We talk about the technical and legal meanings of “personally identifiable information” (PII) and argue that the term means next to nothing and must be greatly de-emphasized, if not abandoned, in order to have a meaningful discourse on data privacy. Here are the main points:

The notion of PII is found in two very different types of laws: data breach notification laws and information privacy laws. In the former, the spirit of the term is to encompass information that could be used for identity theft. We have absolutely no issue with the sense in which PII is used in this category of laws.

On the other hand, in laws and regulations aimed at protecting consumer privacy, the intent is to compel data trustees who want to share or sell data to scrub “PII” in a way that prevents the possibility of re-identification. As readers of this blog know, this is essentially impossible to do in a foolproof way without losing the utility of the data. Our paper elaborates on this and explains why “PII” has no technical meaning, given that virtually any non-trivial information can potentially be used for re-identification.

What we are gunning after is the get-out-of-jail-free card, a.k.a. “safe harbor,” particularly in the HIPAA (health information privacy) context. In current practice, data owners can absolve themselves of responsibility by performing a syntactic “de-identification” of the data (although this isn’t the spirit of the law). Even your genome is not considered identifying!

Meaningful privacy protection is possible if account is taken of the specific types of computations that will be performed on the data (e.g., collaborative filtering, fraud detection, etc.). It is virtually impossible to guarantee privacy by considering the data alone, without carefully defining and analyzing its desired uses.

We are well aware of the burden that this imposes on data trustees, many of whom find even the current compliance requirements onerous. Often there is no one available who understands computer science or programming, and there is no budget to hire someone who does. That is certainly a conundrum, and it isn’t going to be fixed overnight. However, the current situation is a farce and needs to change.

Given that technologically sophisticated privacy protection mechanisms require a fair bit of expertise (although we hope that they will become commoditized in a few years), one possible way forward is by introducing stronger acceptable-use agreements. Such agreements would dictate what the collector or recipient of the data can and cannot do with it. They should be combined with some form of informed consent, where users (or, in the health care context, patients) acknowledge their understanding that there is a re-identification risk. But the law needs to change to pave the way for this more enlightened approach.

Thanks to Vitaly Shmatikov for comments on a draft of this post.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

6 comments June 21, 2010

Conferences: The Good, the Bad and the Ugly aspects

I attended a couple of conferences this week that are outside my usual community. Taking stock of and interacting with a new crowd is always a very interesting experience.

The first was the IAPP Practical Privacy Series. The International Association of Privacy Professionals came about as a result of the fact that the Chief Privacy Officer (and equivalent) positions have suddenly emerged — over the last decade — and become ubiquitous. The role can be broadly described as “privacy compliance.” A big part of the initial impetus seems to have been HIPAA compliance, but the IAPP composition has now diversified greatly, because virtually every company is sitting on a pile of consumer data. There was even someone from Starbucks.

I spoke about anonymization. I was trying to answer the question, “I need to share/sell my data and you’re telling me that anonymization is broken. So what should I do?”. It’s always a fun challenge to make computer science accessible to a non-tech audience (largely lawyers in this case). I think I managed reasonably well.

Next was the ACM Computers, Freedom and Privacy conference (which goes on until Friday). As I understand it, CFP was born at a time when “Cyberspace” was analogous to the Wild West, and there was a big need for self-governance and figuring out the emerging norms. The landscape is of course very different now, since the Internet isn’t a band of outlaws anymore but integrated into normal society. The conference has accordingly morphed somewhat, although a lot of the old crowd still definitely comes here.

The quality of the events I attended were highly variable. I checked out the “unconferences,” but only a couple had a meaningful level of participation and the one I went to seemed to devolve pretty quickly into a penis-waving contest. The session I liked best was a tutorial by Mike Godwin (of Godwin’s law, now counsel for the Wikimedia foundation) on Cyberlaw, mainly First Amendment law.

CFP has parallel sessions. I had a great experience with that format at the Privacy Law Scholars Conference, but this time I’m not so sure — I’m regularly finding conflicts among the sessions I want to attend.

I’m bummed about the fact that there is really no mechanism for me to learn about conferences that are relevant to my interests but are outside my community. (I only learned about the IAPP workshop because I was invited to speak, and CFP purely coincidentally.) Do other researchers face this problem as well? I’m curious to hear about how people keep abreast. I mean, it’s 2010, and this is exactly the kind of problem that social media is supposed to be great at solving, but it’s not really working for me.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

4 comments June 17, 2010

Yet Another Identity Stealing Bug. Will Creeping Normalcy be the Result?

Elie Bursztein points me to a “Cross Site URL Hijacking” attack which, among other things, allows a website to identify a visitor instantly (if they are using Firefox) by finding their Google and possibly Facebook IDs. Here is a live demo and here’s a paper.

For the security geeks, the attack works by exploiting a Firefox bug that allows a page in the attacker domain to infer URLs of pages in the target domain. If a page like target.com/home redirects to target.com/?user=[username] (which is quite common), the attacker can learn the username by requesting the page target.com/home in a script tag.

Let us put this attack in context. Stealing the identity of a web visitor should be familiar to readers of this blog. I’ve recently written about doing this via history stealing, then a bug in Google spreadsheets, and now we have this. While the spreadsheets bug was fixed, the history stealing vulnerability remains in most browsers. Will new bugs be found faster than existing ones getting fixed? The answer is probably yes.

Something that is of much more concern in the long run is Facebook’s instant personalization, which is basically like identity stealing, except it is a feature rather than a bug. Currently Facebook identities are available without user consent to only 3 partners (Yelp, Pandora and docs.com) but there will be inevitable competitive pressures both for Facebook to open this up to more websites as well as for other identity providers to offer a similar service.

Legitimate methods and hacks based on bugs are not entirely distinct. Two XSS attacks on yelp.com were found in quick succession either of which could have been exploited by a third (fourth?) party for identity stealing. Instant personalization (and similar attempts at an “identity layer”) greatly increase the chance of bugs that leak your identity to every website, authorized or not.

As identity-stealing bugs as well as identity-sharing features proliferate, the result is going to be creeping normalcy — users will get slowly inured to the idea that any website they visit might have their identity. And that will be a profound change for the way the web works. Of course, savvy users will know how to turn off the various tracking mechanisms, but most people will be left in the lurch.

We are still at the early stages of this shift. It is clear that it will have both good and ill effects. For example, people are much more civil when interacting under their real-life identity. For this reason, there is quite a clamor for identity. For instance, see News Sites Rethink Anonymous Online Comments and The Forces Align Against Anonymity. But like every change, this one is going to be hard to get used to.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

1 comment June 1, 2010

Facebook, Privacy, Public Opinion and Pitchforks

As just about everyone is already aware, Facebook has been up to a bunch of big brotherly stuff lately, including “instant personalization” — making your identity and data available to 3rd party sites you visit, arguing to treat ToS violations as criminal violations, and forcing you to make your “interests” public (or delete them). Overall, it looks like they’re making a bold move to take control of everyone’s identity and connections, privacy be damned.

The entirely predictable effect of this has been that everything the company now does is being viewed with extreme suspicion. The pitchforks have been sharpened, and the mob gets set off on almost any excuse. In the last week, one somewhat questionable feature, one minor bug and one utter non-event have each been reported as sinister privacy disasters:
  1. The questionable feature was linking your statuses to “connections” pages. The outrage was based on the meme “if your status contains the word FBI then the FBI will have a record of it,” which appears to have started here. That article is full of hyperbole and understandably appears to have been widely misunderstood to be claiming that even private statuses appear on Connection pages (they don’t). There’s really nothing new in terms of the visibility of your statuses: Facebook already had real-time search for public statuses, and the only difference is that someone can now click on the “FBI” page instead of having to type in “FBI” into the search box.
  2. The minor bug was that Facebook started listing Connect-enabled websites you visit in the “Applications” tab in your privacy settings. The sites didn’t get your identity, any of your data, nor did they have priveleges to post to your wall. The fact that you visited them was not visible to anyone else. No actual harm was done. And yet an article titled Facebook’s new features secretly add apps to your profile alleged all of these things without making any real effort to check with Facebook. Facebook quickly fixed the bug and contacted the authors, and they updated the story, but it did little to quell the rumors which took on a life of their own.
  3. The non-issue was Facebook leaking your IP address in email notifications. This is normal behavior: most webmail providers, except gmail, put the sender’s IP into the message header as a spam-prevention technique. This kicked up another shitstorm.

In spite of these unfair accusations, it is hard for me to feel any sympathy for the beleaguered company. This is how public opinion works, and they can’t claim not to have seen it coming. As this fantastic visualization by Matt McKeon shows, Facebook has been on a long and consistent path to make all of your information public, essentially pulling a giant bait-and-switch on their users. They stepped up the pace recently, asked their users to give up too much too fast, and something just snapped.

I think Facebook underestimated the extent to which privacy correlates with trust. They were forgiven for Beacon and other problems in the past, but after the most recent series of privacy violations, it became clear that these were not missteps but deliberate actions. I believe that Facebook’s relationship with its users has changed fundamentally, and isn’t going to mend any time soon. Perhaps Facebook’s reckoning is that they are now big enough that it doesn’t matter any more. That remains to be seen.

On a personal note, someone pretty high up at Facebook emailed me a couple of months ago (although “not in an official capacity”) to have a discussion about privacy issues with some of their upcoming product launches. Unfortunately I was traveling at the time, and when I got back they were no longer interested. I guess by then it was too close to f8 and all the important decisions had been made. I can’t help wondering if the outcome might have been different if I’d been able to meet with them — perhaps they might have eased off just a little bit on their world-domination plans and avoided the straw that broke the camel’s back. But I suspect that that’s just wishful thinking, given that the imperative for their current push in all likelihood came from the very top.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

5 comments May 10, 2010

Is Making Public Data “More Public” a Privacy Violation?

What on earth does more public mean? Technologists draw a simple distinction between data that is public and data that is not. Under this view, the notion of making data more public is meaningless. But common sense tells us otherwise: it’s hard to explain the opposition to public surveillance if you assume that it’s OK to collect, store and use “public” information indiscriminately.

There are entire philosophical theories devoted to understanding what one can and cannot do with public data in different contexts. Recently, danah boyd argued in her SXSW keynote in support of “privacy through obscurity” and how technology is destroying this comfort. According to boyd, most public data is “quasi-public” and technologists don’t have the right to “publicize” it.

Some examples. One can debate the point in the abstract, but there is no question that companies and individuals have repeatedly been bitten when applying the “it’s already public” rule. Let’s look at some examples (the list and the discussion is largely concerned with data on the web).

  1. The availability of the California Birth Index on the web caused considerable consternation about a decade ago, despite the fact that birth records in the state are public and anyone’s birth record can be obtained through official channels albeit in a cumbersome manner.
  2. IRSeek planned to launch a search engine for IRC in 2007 by monitoring and indexing public channels (chatrooms). There was a predictable privacy outcry and they were forced to shut down.
  3. The Infochimps guys crawled the Twitter graph back in 2008 and posted it on their site. Twitter forced them to take the dataset down.
  4. The story was repeated with Pete Warden and Facebook; this time it was nastier and involved the threat of a lawsuit.
  5. MySpace recently started selling user data in bulk on Infochimps. As MySpace has pointed out, the data is already public, but privacy concerns have nevertheless been raised.
  6. One reason for the backlash against Google Buzz was auto-connect: it connected your activity on Google Reader and other services and streamed it to your friends. Your Google Reader activities were already public, but Buzz took it further by broadcasting it.
  7. Spokeo is facing similar criticism. As Snopes explains, “Spokeo displays listings that sometimes contain more personal information than many people are comfortable having made publicly accessible through a single, easy-to-use search site.”

The latter four examples are all from the last couple of months. For some reason the issue has suddenly started cropping up all the time. The current situation is bad for everyone: data trustees and data analysts have no clear guidelines in place, and users/consumers are in a position of constantly having to fight back against a loss of privacy. We need to figure out some ground rules to decide what uses of public data on the web are acceptable.

Why not “none?” I don’t agree with a blanket argument against using data for purposes other than originally intended, for many reasons. The first is that users’ privacy expectations, when they go beyond the public/private dichotomy, are generally poorly articulated, frequently unreasonable and occasionally self-contradictory. (An unfortunate but inevitable consequence of the complexity of technology.) The second reason is that these complex privacy rules, even if they can be figured out, often need to be communicated to the machine.

The third reason is the “greater good.” I’ve opposed that line of reasoning when used to justify reneging on an explicit privacy promise. But when it comes to a promise that was never actually made but merely intuitively understood (or mis-understood) by users, I think the question is different, and my stance is softer. Privacy needs to be weighed against the benefit to society from “publicizing” data — disseminating, aggregating and analyzing it.

In the next article of this series, I will give a rigorous technical characterization of what constitutes publicizing data. My hope is that this will go a long way towards determining what is and is not a violation of privacy. In the meanwhile, I look forward to hearing different opinions.

Thanks to Pete Warden and Vimal Jeyakumar for comments on a draft.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

13 comments April 5, 2010

An open letter to Netflix from the authors of the de-anonymization paper

Dear Netflix,

Today is a sad day. It is also a day of hope.

It is a sad day because the second Netflix challenge had to be cancelled. We never thought it would come to this. One of us has publicly referred to the dampening of research as the “worst possible outcome” of privacy studies. As researchers, we are true believers in the power of data to benefit mankind.

We published the initial draft of our de-anonymization study just two weeks after the dataset for the first Netflix Prize became public. Since we had the math to back up our claims, we assumed that lessons would be learned, and that if there were to be a second data release, it would either involve only customers who opted in, or a privacy-preserving data analysis mechanism. That was three and a half years ago.

Instead, you brushed off our claims, calling them “absolutely without merit,” among other things. It has taken negative publicity and an FTC investigation to stop things from getting worse. Some may make the argument that even if the privacy of some of your customers is violated, the benefit to mankind outweighs it, but the “greater good” argument is a very dangerous one. And so here we are.

We were pleasantly surprised to read the plain, unobfuscated language in the blog post announcing the cancellation of the second contest. We hope that this signals a change in your outlook with respect to privacy. We are happy to see that you plan to “continue to explore ways to collaborate with the research community.”

Running something like the Netflix Prize competition without compromising privacy is a hard problem, and you need the help of privacy researchers to do it right. Fortunately, there has been a great deal of research on “differential privacy,” some of it specific to recommender systems. But there are practical challenges, and overcoming them will likely require setting up an online system for data analysis rather than an “anonymize and release” approach.

Data privacy researchers will be happy to work with you rather than against you. We believe that this can be a mutually beneficial collaboration. We need someone with actual data and an actual data-mining goal in order to validate our ideas. You will be able to move forward with the next competition, and just as importantly, it will enable you to become a leader in privacy-preserving data analysis. One potential outcome could be an enterprise-ready system which would be useful to any company or organization that outsources analysis of sensitive customer data.

It’s not often that a moral imperative aligns with business incentives. We hope that you will take advantage of this opportunity.

Arvind Narayanan and Vitaly Shmatikov


For background, see our paper and FAQ.

To stay on top of future posts on 33bits.org, subscribe to the RSS feed or follow me on Twitter.

19 comments March 15, 2010

History Stealing: It’s All Shades of Grey

Previous articles in this series showed that ‘Ubercookies’ can enable websites to learn the identity of any visitor by exploiting the ‘history stealing’ bug in web browsers, and presented different types of de-anonymization attacks. This article is all about the question, “but who is the adversary?”

Good and evil. It is tempting for security researchers to think of the world in terms of good guys and bad guys — white hats and black hats. It is a view of the world that is probably hardwired into our brains, reflected everywhere from religious beliefs to Hollywood plots. But reality is more complex. Heroes are flawed, and the bad guys are not really evil. But enough with the moral lecture, let’s see how this pertains to history stealing and identity stealing.

Black hat. I don’t need to say very much to convince you of the black-hat uses of learning your identity. I’ve already talked about how a phishing site that knows who you are can deliver a customized page that is dramatically more effective. Or imagine the potential for surveillance — with the cooperation of a single ad network, a Government can put a de-anonymization script on millions of websites and keep tabs on every click anyone makes. In fact, you only need to be de-anonymized once; regular tracking scripts will do the job after that.

Grey hat. But I want to argue here that the grey hat use case is far more likely/common than the black hat. For example, here’s an article arguing that websites should sniff their visitors’ history for a “better user experience.” The nonchalant way in which the author talks about exploiting a nasty bug and the lack of mention of any privacy concerns is both scary and amusing. In the comments section of that article you can find links to implementations. In fact there’s even a website selling history sniffing code that website owners can drop into their site.

Shades of grey. Consider a thought experiment. Suppose a website delivered a “better user experience” by sniffing your history, but didn’t send that information back to the server. Whatever web page customization happens is done purely in the browser using Javascript. Is that unethical? If you think it’s unethical, what about if the site popped up a box to get the user’s consent before doing so? Remember that 80% of users are going to click OK without understanding what the box says. At this point it’s looking pretty close to Adnostic, a paper/project I’ve been working on as a privacy enhancing tool.

My point here is not to defend history stealing. Rather, I hope I’ve convinced you that there’s a gentle gradient between white and black hat, at least in terms of intent, and that it’s hard to condemn someone unequivocally.

Incentive. For the most part, people who are using history sniffing “in the wild” are just trying to make an extra buck on their website through advertising. This is an extremely powerful incentive. You may not know how terrible ad targeting currently is on the web. You can find any number of horror stories like this one from Stack Overflow that says a million pageviews a day aren’t enough to pay one person part time. Anything that improves ad rates directly impacts the bottom line.

Now consider this:

The future of Internet ad targeting may lie in combining online and offline behavioral data. Several Web networks have already formed relationships with, or purchased, offline database companies. AdForce has a relationship with Experion, which has an offline database of about 120 million households in North America; likewise, DoubleClick purchased Abacus Direct, a shared catalog database with information on over 90 million U.S. households. 24/7 Media has also formed an alliance to link online and offline data.

Linking online and offline data means one thing: being able to not only track users online but also identify them. Hundreds of millions of dollars say this is going to happen one way or the other.

Some grey hat use cases. The “improved user experience” article linked above advocates history stealing for picking the right third party service providers to direct the user to by detecting which one they are already using – the right RSS reader, social bookmarking site, federated identity provider, mapping service, etc. But let’s talk about identity stealing instead of just history stealing.

Ad targeting, which I’ve already mentioned, can be improved not just by combining online with offline data but also by combining social network profile data with click tracking data. This may already be happening on some social networking sites, but identity stealing makes it possible to grab the user’s social network profile information no matter which site they’re on.

As I pointed out earlier, users are more likely to fall for phishing when the site addresses them by name. But this effect is not in any way specific to phishing. Any new site that wants to get users to try their service or to stick around longer can benefit from this technique to improve trust. Marketers have long absorbed Dale Carnegie’s wisdom that the sweetest word you can say to a person is their own name.

Grey hat is more worrisome than black hat. There are two reasons to worry about grey hat more than black hat. Every website that doesn’t have a reputation to lose is a potential user of grey hat techniques, whether history stealing or anything else. Second, grey hats are typically not using it for anything illegal (unlike phishers), which means you can’t use the law to shut them down.

This is a general thought that I want to leave computer security researchers. We are used to thinking of adversaries as malicious agents; this thinking has been reinforced by the fact that in the last decade or two, hacking went from harmless pranks to organized crime. But the nature of the adversary who exploits privacy flaws is very different from the case of data security breaches. It is important to keep this distinction in mind to be able to develop effective responses.

The role of the browser. In the next article, I will take a broader look at identity and anonymity on the Web, and discuss the role that browsers are going to play in dictating the default level of identity in the years to come.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

3 comments March 9, 2010

Previous Posts


Me, elsewhere

Get notified

Be notified when there's a new post — subscribe to the feed, follow me on twitter or use the email subscription box below.

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Recent comments

Tags

a-rod academia aggregation algorithm algorithms anonymity author recognition censorship conference data de-anonymization DNA DNA profiling eccentricity entropy ethics facebook forensics free speech FTC genome Google google buzz Google docs graph isomorphism history stealing Internet k-anonymity law lending club livejournal location meta netflix privacy privacy by design privacy policy re-identification social network analysis social networks stylometry theory ubercookies web browsers web security