De-anonymizing the Internet

January 15, 2009 at 3:16 am 21 comments

I’ve been thinking about this problem for quite a while: is it possible to de-anonymize text that is posted anonymously on the Internet by matching the writing style with other Web pages/posts where the authorship is known? I’ve discussed this with many privacy researchers but until recently never written anything down. When someone asked essentially the same question on Hacker News, I barfed up a stream of thought on the subject :-) Here it is, lightly edited.

Each one of us has a writing style that is idiosyncratic enough to have a unique “fingerprint”. However, it is an open question whether it can be efficiently extracted.

The basic idea for constructing a fingerprint is this. Consider two words that are nearly interchangeable, say ‘since’ and ‘because’. Different people use the two words in a differing proportion. By comparing the relative frequency of the two words, you get a little bit of information about a person, typically under 1 bit. But by putting together enough of these ‘markers’, you can construct a profile.

The beginning of modern, rigorous research in this field was by Mosteller and Wallace in 1964: they identified the author of the disputed Federalist papers, almost 200 years after they were written (note that there were only three possible candidates!). They got on the cover of TIME, apparently. Other “coups” for writing-style de-anonymization are the identification of the author of Primary Colors, as well as the unabomber (his brother recognized his style, it wasn’t done by statistical/computational means).

The current state of the art is summarized in this bibliography. Now, that list stops at 2005, but I’m assuming there haven’t been earth-shattering changes since then. I’m familiar with the results from those papers; the curious thing is that they stop at corpuses of a couple hundred authors or so — i.e, identifying one anonymous poster out of say 200, rather than a million. This is probably because they had different applications in mind, such as identification within a company, instead of Internet-scale de-anonymization. Note that the amount of information you need is always logarithmic in the potential number of authors, and so if you can do 200 authors you can almost definitely push it to a few tens of thousands of authors.

The other interesting thing is that the papers are fixated with ‘topic-free’ identification, where the texts aren’t about a particular topic, making the problem harder. The good news is that when you’re doing this Internet-scale, nobody is stopping you from using topic information, making it a lot easier.

So my educated guess is that Internet-scale writing style de-anonymization is possible. However, you’d need fairly long texts, perhaps a page or two. It’s doubtful that anything can be done with a single average-length email.

Another potential de-anonymization strategy is to use typing pattern fingerprinting (keystroke dynamics), i.e, analyzing the timing between our keystrokes (yes, this works even for non-touch typists.) This is already used in commercial products as an additional factor in password authentication. However, the implications for de-anonymization have not been explored, and I think it’s very, very feasible. i.e, if google were to insert javascript into gmail to fingerprint you when you were logged in, they could use the same javascript to identify you on any web page where you type in text even if you don’t identify yourself. Now think about the de-anonymization possibilities you can get by combining analysis of writing style and keystroke dynamics…

By the way, make no mistake: the malicious uses of this far overwhelm the benevolent uses. Once this technology becomes available, it will be very hard to post anonymously at all. Think of the consequences for political dissent or whistleblowers. The great firewall of China could simply insert a piece of javascript into every web page, and poof, there goes the anonymity of everyone in China.

It think it’s likely that one can build a tool to protect anonymity by taking a chunk of writing and removing your fingerprint from it, but it will need a lot of work, and will probably lead to a cat-and-mouse game between improved de-anonymization and obfuscation techniques. Note the caveats, however: most ordinary people will not have the foreknowledge to find and use such a tool. Second, think of all the compromising posts — rants about employers, accounts from cheating spouses, political dissent, etc. — that have already been written. The day will come when some kid will download a script, let a crawler loose on the web, and post the de-anonymized results for all to see. There will be interesting consequences.

If you’re interested in working on this problem–either writing style analysis for breaking anonymity or obfuscation techniques for protecting anonymity–drop me a line.

Entry filed under: Uncategorized. Tags: , , , , .

The Fallacy of Anonymous Institutions Social Network Analysis: Can Quantity Compensate for Quality?

21 Comments Add your own

  • 1. Ilya  |  January 15, 2009 at 5:42 pm

    A fingerprint associated with this post is “corpuses” (vs “corpora”). Apparently it has good predictive ability – 6 million vs 5 million G-hits.

    BTW, I am surprised you haven’t enabled OpenID login. Wanna blog about it?

    Reply
  • 2. Arvind  |  January 15, 2009 at 6:30 pm

    Ha ha, corpora! I had no idea that was the plural :-) Did you mean 60k rather than 6M? That’s what I’m getting.

    I would love to enable OpenID but I don’t think it’s possible. This is hosted on wordpress.com, not on my servers.

    Reply
  • 3. Ilya  |  January 15, 2009 at 6:35 pm

    Hmmm. I am consistently seeing 6M for both corpora and corpuses on Google.

    You may want to check this one out—http://wordpress.org/extend/plugins/openid/

    Reply
  • 4. Arvind  |  January 15, 2009 at 6:44 pm

    Seriously weird!

    Re. openid, that’s what I meant. I don’t control the wordpress install, I can’t add plugins.

    Reply
  • 5. Ilya  |  January 15, 2009 at 7:06 pm

    This is what I am seeing: 6M+!

    Reply
  • 6. Hoeteck  |  January 19, 2009 at 7:12 pm

    How about deanonymizing anonymous reviews and/or anonymous submissions? Here, you certainly get substantial leverage from topics. There’s also the call for papers to help along for the reviews, and readily accessible writing samples for submissions.

    Reply
  • 7. Arvind  |  January 19, 2009 at 7:27 pm

    Indeed. There was one highly negative review of our Netflix paper when we first submitted it, which we de-anonymized right away using these techniques. It was very ironic and hilarious.

    This is part of the reason I’m not a fan of anonymous submissions. In my experience, people who’ve been working in a field for a long time can often tell at a glance who the authors of a submitted paper are.

    I’ve seen a cute paper that shows how to de-anonymize reviewers in a really clever way. I can’t talk about it because I’m not sure if it’s public yet.

    Reply
    • 8. Anonymous Rex  |  January 6, 2010 at 1:48 am

      Hi Arvind,

      First of all, I enjoy both your Livejournal and this blog. Thanks for the interesting thoughts.

      Since a year or so has passed since this comment of yours, can you say something about this “cute paper” that de-anonymizes reviewers “in a really clever way”?

      Thanks!

      Reply
  • 11. Ray  |  January 19, 2009 at 9:07 pm

    I wonder if everybody has that unique a writing style. Highly prolific authors are probably disgintuishable, but there’s the broad portion of the population that, while formally literate, doesn’t do much writing. Can you distinguish between people writing grade school level sentences (with an extremely limited vocabulary) at a rate of, say, 1000 words a year?

    Reply
  • 12. Arvind  |  January 19, 2009 at 9:25 pm

    Oh, don’t confuse “writing style,” used as a technical term here, with literary style. People who write grade-school level sentences are actually much better candidates–they make idiosyncratic spelling errors. As an information theoretic statement, the fact that everyone has a unique writing style is incontrovertible. It assumes that samples of unlimited size are available and the matching algorithm has infinite computational resources.

    The availability of text is an entirely separate issue, not to be confused with writing style. I did mention in my post that you’d need a page or two of writing at least to be able to match writing samples with any measure of reliability.

    Edit. I do acknowledge that there is a point beyond which decreasing language proficiency negatively impacts fingerprinting, but it’s an extremely low threshold.

    Reply
  • 13. Is Anonymity Research Ethical? « 33 Bits of Entropy  |  April 9, 2009 at 8:53 pm

    […] researcher who is working on writing style identification (”stylometry”), after reading my post on related de-anonymization techniques, wonders what the positive impact of such research could be, […]

    Reply
  • […] on some (currently exploratory) investigations into authorship recognition (see my post on De-anonymizing the Internet). We’ve been wondering why progress in existing papers seems to hit a wall at around 100 […]

    Reply
  • 15. Jodi Schneider  |  November 26, 2009 at 8:35 am

    “The current state of the art is summarized in this bibliography. Now, that list stops at 2005, but I’m assuming there haven’t been earth-shattering changes since then.” This bibliography seems to miss the literary side of author identification (just scanning but I don’t see, for instance, the computer analyzes of Shakespeare from UMass/Renaissance Center).

    Reply
    • 16. Arvind  |  November 26, 2009 at 8:42 am

      Do you have a link that explains what you’re referring to?

      Reply
  • 17. Onymous.  |  September 22, 2010 at 8:50 pm

    On the comment in the linked site, you write: “My main interest is to write a paper and possibly build tools to take a chunk of writing and try to remove your fingerprint from it, i.e, protect anonymity”

    Interesting project. Did you ever get anywhere? Did anyone ever contact you, expressing an interest?

    Reply
    • 18. Arvind  |  September 22, 2010 at 9:33 pm

      Happily, yes. I started working on it with researchers at UC Berkeley, although the project has been on hold for a while now.

      Reply
  • 19. Anne Ominous  |  April 14, 2011 at 8:05 am

    This whole idea revolves around entries by a relatively small group of people writing about limited subjects. The scope is far too narrow to be very significant, much less impressive.

    I am not very impressed. If I solicited a treatise on some specific subject from 200 colleagues, who had written known papers about that very subject… it is not very damned surprising that they can be distinguished by their writings.

    Quote: “In my experience, people who’ve been working in a field for a long time can often tell at a glance who the authors of a submitted paper are.”

    Sure. And experienced poker players who know their opponents can often tell if they are bluffing. And software for that has been designed, too.

    But in neither case, AFAIK, has software been shown to be able to identify individuals from a large crowd of random people, which is really what this would need to make it noteworthy.

    Um… nothing personal intended… but what I am saying is that I do not believe this is noteworthy.

    Reply
    • 20. Arvind Narayanan  |  April 14, 2011 at 12:54 pm

      Boy do my colleagues and I have a surprise in store for you! Stay tuned.

      Reply
      • 21. Anne Ominous  |  April 20, 2014 at 5:05 pm

        Not so.

        I am not at all surprised that this could be a viable technique. My point was that the evidence as presented on this page does not appear to strongly establish that viability.

        That is what I wrote, that is what I meant.

        Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


About 33bits.org

I'm an assistant professor of computer science at Princeton. I research (and teach) information privacy and security, and moonlight in technology policy.

This is a blog about my research on breaking data anonymization, and more broadly about information privacy, law and policy.

For an explanation of the blog title and more info, see the About page.

Me, elsewhere

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 248 other followers