Posts Tagged Internet

Women in Tech: How Anonymity Contributes to the Problem

Like Michael Arrington, I too have sat on the sidelines of the debate on women in tech. Unlike Michael Arrington, I did so because nobody asked for my opinion. There is, however, one aspect of the debate that I’m qualified to comment on.

The central issue seems to be whether the low participation rate of women in technology is due to a hostile environment in the tech industry (e.g., sexism, overt or covert) or due to external factors, whether genetic or social, that influence women to pick career paths other than technology without even giving it a shot.

Arrington thinks it’s the latter, and makes a strong case for his position. In response, many have pointed out various behaviors common in the tech industry that make it unappealing to women. Jessica B. Hamrick talks about rampant elitism which affects women disproportionately. What I’m more interested in today is Michelle Greer’s account of being viciously attacked for a relatively innocuous comment on Arrington’s post.

Let me come right out and say it: while I am a defender of the right to anonymous speech, I believe it has no place whatsoever in the vast majority of discussion forums. The reason is simple: there is something about anonymity that completely dismantles our evolved social norms and civility and makes us behave like apes. Not all of us, to be sure, but it only takes a few to ruin it for everyone. Or to put it in plainer terms:

There is no doubt that sexist comments online — the vast majority of them anonymous — contribute hugely to the problem of tech being a hostile environment for women. While there are rude comments directed at everyone, just look around if you need convincing that the ones that attack someone specifically for being female tend to be much more depraved. It is also true that rude behavior online is not limited to tech fields, but it creates more of a barrier there because online participation is essential for being relevant.

Here’s my suggestion to everyone who’d like to do something to make tech less hostile to women: perhaps the best return on your time that you can get is by making anonymous, unmoderated comments a thing of the past. Abolish it on your own sites, and write to other site admins and educate them about the importance of this issue. And when you see an uncivil comment, either educate or ignore the person, but try not to get enraged — you’d be feeding the troll.

Thanks to Ann Kilzer for reviewing a draft.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

4 comments August 30, 2010

De-anonymizing the Internet

I’ve been thinking about this problem for quite a while: is it possible to de-anonymize text that is posted anonymously on the Internet by matching the writing style with other Web pages/posts where the authorship is known? I’ve discussed this with many privacy researchers but until recently never written anything down. When someone asked essentially the same question on Hacker News, I barfed up a stream of thought on the subject :-) Here it is, lightly edited.

Each one of us has a writing style that is idiosyncratic enough to have a unique “fingerprint”. However, it is an open question whether it can be efficiently extracted.

The basic idea for constructing a fingerprint is this. Consider two words that are nearly interchangeable, say ‘since’ and ‘because’. Different people use the two words in a differing proportion. By comparing the relative frequency of the two words, you get a little bit of information about a person, typically under 1 bit. But by putting together enough of these ‘markers’, you can construct a profile.

The beginning of modern, rigorous research in this field was by Mosteller and Wallace in 1964: they identified the author of the disputed Federalist papers, almost 200 years after they were written (note that there were only three possible candidates!). They got on the cover of TIME, apparently. Other “coups” for writing-style de-anonymization are the identification of the author of Primary Colors, as well as the unabomber (his brother recognized his style, it wasn’t done by statistical/computational means).

The current state of the art is summarized in this bibliography. Now, that list stops at 2005, but I’m assuming there haven’t been earth-shattering changes since then. I’m familiar with the results from those papers; the curious thing is that they stop at corpuses of a couple hundred authors or so — i.e, identifying one anonymous poster out of say 200, rather than a million. This is probably because they had different applications in mind, such as identification within a company, instead of Internet-scale de-anonymization. Note that the amount of information you need is always logarithmic in the potential number of authors, and so if you can do 200 authors you can almost definitely push it to a few tens of thousands of authors.

The other interesting thing is that the papers are fixated with ‘topic-free’ identification, where the texts aren’t about a particular topic, making the problem harder. The good news is that when you’re doing this Internet-scale, nobody is stopping you from using topic information, making it a lot easier.

So my educated guess is that Internet-scale writing style de-anonymization is possible. However, you’d need fairly long texts, perhaps a page or two. It’s doubtful that anything can be done with a single average-length email.

Another potential de-anonymization strategy is to use typing pattern fingerprinting (keystroke dynamics), i.e, analyzing the timing between our keystrokes (yes, this works even for non-touch typists.) This is already used in commercial products as an additional factor in password authentication. However, the implications for de-anonymization have not been explored, and I think it’s very, very feasible. i.e, if google were to insert javascript into gmail to fingerprint you when you were logged in, they could use the same javascript to identify you on any web page where you type in text even if you don’t identify yourself. Now think about the de-anonymization possibilities you can get by combining analysis of writing style and keystroke dynamics…

By the way, make no mistake: the malicious uses of this far overwhelm the benevolent uses. Once this technology becomes available, it will be very hard to post anonymously at all. Think of the consequences for political dissent or whistleblowers. The great firewall of China could simply insert a piece of javascript into every web page, and poof, there goes the anonymity of everyone in China.

It think it’s likely that one can build a tool to protect anonymity by taking a chunk of writing and removing your fingerprint from it, but it will need a lot of work, and will probably lead to a cat-and-mouse game between improved de-anonymization and obfuscation techniques. Note the caveats, however: most ordinary people will not have the foreknowledge to find and use such a tool. Second, think of all the compromising posts — rants about employers, accounts from cheating spouses, political dissent, etc. — that have already been written. The day will come when some kid will download a script, let a crawler loose on the web, and post the de-anonymized results for all to see. There will be interesting consequences.

If you’re interested in working on this problem–either writing style analysis for breaking anonymity or obfuscation techniques for protecting anonymity–drop me a line.

16 comments January 15, 2009


Me, elsewhere

Get notified

Be notified when there's a new post — subscribe to the feed, follow me on twitter or use the email subscription box below.

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Recent comments

Tags

a-rod academia aggregation algorithm algorithms anonymity author recognition censorship conference data de-anonymization DNA DNA profiling eccentricity entropy ethics facebook forensics free speech FTC genome Google google buzz Google docs graph isomorphism history stealing Internet k-anonymity law lending club livejournal location meta netflix privacy privacy by design privacy policy re-identification social network analysis social networks stylometry theory ubercookies web browsers web security