De-anonymizing the Internet

January 15, 2009

I’ve been thinking about this problem for quite a while: is it possible to de-anonymize text that is posted anonymously on the Internet by matching the writing style with other Web pages/posts where the authorship is known? I’ve discussed this with many privacy researchers but until recently never written anything down. When someone asked essentially the same question on Hacker News, I barfed up a stream of thought on the subject :-) Here it is, lightly edited.

Each one of us has a writing style that is idiosyncratic enough to have a unique “fingerprint”. However, it is an open question whether it can be efficiently extracted.

The basic idea for constructing a fingerprint is this. Consider two words that are nearly interchangeable, say ‘since’ and ‘because’. Different people use the two words in a differing proportion. By comparing the relative frequency of the two words, you get a little bit of information about a person, typically under 1 bit. But by putting together enough of these ‘markers’, you can construct a profile.

The beginning of modern, rigorous research in this field was by Mosteller and Wallace in 1964: they identified the author of the disputed Federalist papers, almost 200 years after they were written (note that there were only three possible candidates!). They got on the cover of TIME, apparently. Other “coups” for writing-style de-anonymization are the identification of the author of Primary Colors, as well as the unabomber (his brother recognized his style, it wasn’t done by statistical/computational means).

The current state of the art is summarized in this bibliography. Now, that list stops at 2005, but I’m assuming there haven’t been earth-shattering changes since then. I’m familiar with the results from those papers; the curious thing is that they stop at corpuses of a couple hundred authors or so — i.e, identifying one anonymous poster out of say 200, rather than a million. This is probably because they had different applications in mind, such as identification within a company, instead of Internet-scale de-anonymization. Note that the amount of information you need is always logarithmic in the potential number of authors, and so if you can do 200 authors you can almost definitely push it to a few tens of thousands of authors.

The other interesting thing is that the papers are fixated with ‘topic-free’ identification, where the texts aren’t about a particular topic, making the problem harder. The good news is that when you’re doing this Internet-scale, nobody is stopping you from using topic information, making it a lot easier.

So my educated guess is that Internet-scale writing style de-anonymization is possible. However, you’d need fairly long texts, perhaps a page or two. It’s doubtful that anything can be done with a single average-length email.

Another potential de-anonymization strategy is to use typing pattern fingerprinting (keystroke dynamics), i.e, analyzing the timing between our keystrokes (yes, this works even for non-touch typists.) This is already used in commercial products as an additional factor in password authentication. However, the implications for de-anonymization have not been explored, and I think it’s very, very feasible. i.e, if google were to insert javascript into gmail to fingerprint you when you were logged in, they could use the same javascript to identify you on any web page where you type in text even if you don’t identify yourself. Now think about the de-anonymization possibilities you can get by combining analysis of writing style and keystroke dynamics…

By the way, make no mistake: the malicious uses of this far overwhelm the benevolent uses. Once this technology becomes available, it will be very hard to post anonymously at all. Think of the consequences for political dissent or whistleblowers. The great firewall of China could simply insert a piece of javascript into every web page, and poof, there goes the anonymity of everyone in China.

It think it’s likely that one can build a tool to protect anonymity by taking a chunk of writing and removing your fingerprint from it, but it will need a lot of work, and will probably lead to a cat-and-mouse game between improved de-anonymization and obfuscation techniques. Note the caveats, however: most ordinary people will not have the foreknowledge to find and use such a tool. Second, think of all the compromising posts — rants about employers, accounts from cheating spouses, political dissent, etc. — that have already been written. The day will come when some kid will download a script, let a crawler loose on the web, and post the de-anonymized results for all to see. There will be interesting consequences.

If you’re interested in working on this problem–either writing style analysis for breaking anonymity or obfuscation techniques for protecting anonymity–drop me a line.


Entry Filed under: Uncategorized. Tags: , , , , .

16 Comments Add your own

  • 1. Ilya  |  January 15, 2009 at 5:42 pm

    A fingerprint associated with this post is “corpuses” (vs “corpora”). Apparently it has good predictive ability – 6 million vs 5 million G-hits.

    BTW, I am surprised you haven’t enabled OpenID login. Wanna blog about it?

    Reply
  • 2. Arvind  |  January 15, 2009 at 6:30 pm

    Ha ha, corpora! I had no idea that was the plural :-) Did you mean 60k rather than 6M? That’s what I’m getting.

    I would love to enable OpenID but I don’t think it’s possible. This is hosted on wordpress.com, not on my servers.

    Reply
  • 3. Ilya  |  January 15, 2009 at 6:35 pm

    Hmmm. I am consistently seeing 6M for both corpora and corpuses on Google.

    You may want to check this one out—http://wordpress.org/extend/plugins/openid/

    Reply
  • 4. Arvind  |  January 15, 2009 at 6:44 pm

    Seriously weird!

    Re. openid, that’s what I meant. I don’t control the wordpress install, I can’t add plugins.

    Reply
  • 5. Ilya  |  January 15, 2009 at 7:06 pm

    This is what I am seeing: 6M+!

    Reply
  • 6. Hoeteck  |  January 19, 2009 at 7:12 pm

    How about deanonymizing anonymous reviews and/or anonymous submissions? Here, you certainly get substantial leverage from topics. There’s also the call for papers to help along for the reviews, and readily accessible writing samples for submissions.

    Reply
  • 7. Arvind  |  January 19, 2009 at 7:27 pm

    Indeed. There was one highly negative review of our Netflix paper when we first submitted it, which we de-anonymized right away using these techniques. It was very ironic and hilarious.

    This is part of the reason I’m not a fan of anonymous submissions. In my experience, people who’ve been working in a field for a long time can often tell at a glance who the authors of a submitted paper are.

    I’ve seen a cute paper that shows how to de-anonymize reviewers in a really clever way. I can’t talk about it because I’m not sure if it’s public yet.

    Reply
    • 8. Anonymous Rex  |  January 6, 2010 at 1:48 am

      Hi Arvind,

      First of all, I enjoy both your Livejournal and this blog. Thanks for the interesting thoughts.

      Since a year or so has passed since this comment of yours, can you say something about this “cute paper” that de-anonymizes reviewers “in a really clever way”?

      Thanks!

      Reply
  • 11. Ray  |  January 19, 2009 at 9:07 pm

    I wonder if everybody has that unique a writing style. Highly prolific authors are probably disgintuishable, but there’s the broad portion of the population that, while formally literate, doesn’t do much writing. Can you distinguish between people writing grade school level sentences (with an extremely limited vocabulary) at a rate of, say, 1000 words a year?

    Reply
  • 12. Arvind  |  January 19, 2009 at 9:25 pm

    Oh, don’t confuse “writing style,” used as a technical term here, with literary style. People who write grade-school level sentences are actually much better candidates–they make idiosyncratic spelling errors. As an information theoretic statement, the fact that everyone has a unique writing style is incontrovertible. It assumes that samples of unlimited size are available and the matching algorithm has infinite computational resources.

    The availability of text is an entirely separate issue, not to be confused with writing style. I did mention in my post that you’d need a page or two of writing at least to be able to match writing samples with any measure of reliability.

    Edit. I do acknowledge that there is a point beyond which decreasing language proficiency negatively impacts fingerprinting, but it’s an extremely low threshold.

    Reply
  • 13. Is Anonymity Research Ethical? « 33 Bits of Entropy  |  April 9, 2009 at 8:53 pm

    [...] researcher who is working on writing style identification (”stylometry”), after reading my post on related de-anonymization techniques, wonders what the positive impact of such research could be, [...]

    Reply
  • [...] on some (currently exploratory) investigations into authorship recognition (see my post on De-anonymizing the Internet). We’ve been wondering why progress in existing papers seems to hit a wall at around 100 [...]

    Reply
  • 15. Jodi Schneider  |  November 26, 2009 at 8:35 am

    “The current state of the art is summarized in this bibliography. Now, that list stops at 2005, but I’m assuming there haven’t been earth-shattering changes since then.” This bibliography seems to miss the literary side of author identification (just scanning but I don’t see, for instance, the computer analyzes of Shakespeare from UMass/Renaissance Center).

    Reply
    • 16. Arvind  |  November 26, 2009 at 8:42 am

      Do you have a link that explains what you’re referring to?

      Reply

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Trackback this post  |  Subscribe to the comments via RSS Feed


Me, elsewhere

Get notified

Be notified when there's a new post — subscribe to the feed, follow me on twitter or use the email subscription box below.

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Recent comments

Tags

a-rod academia aggregation algorithm algorithms anonymity author recognition censorship conference data de-anonymization DNA DNA profiling eccentricity entropy ethics facebook forensics free speech FTC genome Google google buzz Google docs graph isomorphism history stealing Internet k-anonymity law lending club livejournal location meta netflix privacy privacy by design privacy policy re-identification social network analysis social networks stylometry theory ubercookies web browsers web security