Posts tagged ‘anonymity’

Cookies, Supercookies and Ubercookies: Stealing the Identity of Web Visitors

Synopsis. Highly sticky techniques called supercookies for tracking web visitors are becoming well known. But the state of the art has in fact advanced beyond tracking, allowing a website to learn the identity of a visitor. I will call these techniques ubercookies; this article describes one such recently discovered technique. Future articles in this series will describe even more powerful variants and discuss the implications.

Cookies. Most people are aware that their web browsing activity over time and across sites can be tracked using cookies. When you are being tracked, it can be deduced that the same person visited certain sites at certain times, but the sites doing the tracking don’t know who you are, i.e., you name, etc., unless you choose to tell them in some way, such as by logging in.

Cookies are easy to delete, and so there’s been a big impetus in the Internet advertising industry to discover and deploy more robust tracking mechanisms.

Supercookies. You may surprised to find just how helpless a user is against a site (or more usually, a network of sites) that is truly determined to track them. There are Flash cookies, much harder to delete, some of which respawn the regular HTTP cookies that you delete. The EFF’s Panopticlick project demonstrates many “browser fingerprinting” methods which are more sophisticated. (Jonathan Mayer’s senior thesis contained a smaller-scale demonstration of some of those techniques).

A major underlying reason for a lot of these problems is that any browser feature that allows a website to store “state” on the client can be abused for tracking, and there are a bewildering variety of these. There is a great analysis in a paper by my Stanford colleagues. One of the points they make is that co-operative tracking by websites is essentially impossible to defend against.

Ubercookies: history stealing. Now let’s get to the scary stuff: uncovering identity. History stealing or history sniffing is an unintended consequence of the way the web is designed; it allows a website to learn which URLs you’ve been to. While a site can’t simply ask your browser for a list of visited URLs, it can ask “yes/no” questions and your browser will faithfully respond. The most common way of doing this is by injecting invisible links into the page using Javascript and exploiting the fact that the CSS link color attribute depends on whether the link has been visited or not.

History stealing has been known for a decade, and browser vendors have failed to fix it because it cannot be fixed without sacrificing some useful functionality (the crude way is to turn off visited link coloring altogether; a subtler solution is SafeHistory). Increasingly worse consequences have been discovered over the years: for example, a malicious site can learn which bank you use and customize a phishing page accordingly. But a paper (full text, PDF) coming out at this year’s IEEE S&P conference at Oakland takes it to a new level.

Identity. Let’s pause for a second and think about what finding your identity means. In the modern, social web, social network accounts have become our de-facto online identities, and most people reveal their name and at least some other real-world information about ourselves on our profiles. So if the attacker can discover the URL of your social network profile, we can agree that he has identified you for all practical purposes. And the new paper shows how to do just that.

The attack relies on the following observations:

  1. Almost all social networking sites have some kind of “group” functionality: users can add themselves to groups.
  2. Users typically add themselves to multiple groups, at least some of which are public.
  3. Group affiliations, just like your movie-watching history and many other types of attributes, are sufficient to fingerprint a user. There’s a high chance there’s no one else who belongs to the same set of groups that you do (or is even close). [Aside: I used this fact to show that Lending Club data can be de-anonymized.]
  4. Users who belong to a group are likely to visit group-specific URLs that are predictable.

Put the above facts together, and the attack emerges: the attacker (an arbitrary website you visit, without the co-operation of whichever social network is used as an attack enabler) uses history stealing to test a bunch of group-related URLs one by one until he finds a few (public) groups that the anonymous user probably belongs to. The attacker has already crawled the social network, and therefore knows which user belongs to which groups. Now he puts two and two together: using the list of groups he got from the browser, he does a search on the backend to find the (usually unique) user who belongs to all those groups.

Needless to say, this is a somewhat simplified description. The algorithm can be easily modified so that it will work even if some of the groups have disappeared from your history (say because you clear it once in a while) or if you’ve visited groups you’re not a member of. The authors demonstrated that the attack with real users on the Xing network, and also showed theoretically that it is feasible on a number of other social networks including Facebook and Myspace. It takes a few thousand Javascript queries and runs in a few seconds on modern browsers, which makes it pretty much surreptitious.

Fallout. There are only two ways to try to fix this. The first is for all the social networking sites to change their URL patterns by randomizing them so that point 4 above (predictable URL identifying that you belong to a group) is no longer true. The second is for all the browser vendors to fix their browsers so that history stealing is no longer possible.

The authors contacted several of the social networks; Xing quickly implemented the URL randomization fix, which I find surprising and impressive. Ultimately, however, Xing’s move will probably be no more than a nice gesture, for the following reason.

Over the last few days, I have been working on a stronger version of this attack which:

  • can make use of every URL in the browser history to try and identify the user. This means that server-side fixes are not possible, because literally every site on the web would need to implement randomization.
  • avoids the costly crawling step, further lowering the bar to executing the attack.

That leaves browser-based fixes for history stealing, which hasn’t happened in the 10 years that the problem has been known. Will browsers vendors finally accept the functionality hit and deal with the problem? We can hope so, but it remains to be seen.

In the next article, I will describe the stronger attack and also explain in more detail why your profile page on almost any website is a very strong identifier.

Thanks to Adam Bossy for reviewing a draft.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

February 18, 2010 at 7:49 am 28 comments

The Secret Life of Data

Some people claim that re-identification attacks don’t matter, the reasoning being: “I’m not important enough for anyone to want to invest time on learning private facts about me.” At first sight that seems like a reasonable argument, at least in the context of the re-identification algorithms I have worked on, which require considerable human and machine effort to implement.

The argument is nonetheless fallacious, because re-identification typically doesn’t happen at the level of the individual. Rather, the investment of effort yields results over the entire database of millions of people (hence the emphasis on “large-scale” or “en masse”.) On the other hand, the harm that occurs from re-identification affects individuals. This asymmetry exists because the party interested in re-identifying you and the party carrying out the re-identification are not the same.

In today’s world, the entities most interested in acquiring and de-anonymizing large databases might be data aggregation companies like ChoicePoint that sell intelligence on individuals, whereas the party interested in using the re-identified information about you would be their clients/customers: law enforcement, an employer, an insurance company, or even a former friend out to slander you.

Data passes through multiple companies or entities before reaching its destination, making it hard to prove or even detect that it originated from a de-anonymized database. There are lots of companies known to sell “anonymized” customer data: for example Practice Fusion “subsidizes its free EMRs by selling de-identified data to insurance groups, clinical researchers and pharmaceutical companies.” On the other hand, companies carrying out data aggregation/de-anonymization are a lot more secretive about it.

Another piece of the puzzle is what happens when a company goes bankrupt. Decode genetics recently did, which is particularly interesting because they are sitting on a ton of genetic data. There are privacy assurances in place in their original Terms of Service with their customers, but will that bind the new owner of the assets? These are legal gray areas, and are frequently exploited by companies looking to acquire data.

At the recent FTC privacy roundtable, Scott Taylor of Hewlett Packard said his company regularly had the problem of not being able to determine where data is being shared downstream after the first point of contact. I’m sure the same is true of other companies as well. (How then could we possibly expect third-party oversight of this process?)  Since data fuels the modern Web economy, I suspect that the process of moving data around will continue to become more common as well as more complex, with more steps in the chain. We could use a good name for it — “data laundering,” perhaps?

February 6, 2010 at 8:48 pm 1 comment

The Internet has no Delete Button: Limits of the Legal System in Protecting Anonymity

It is futile to try to stay anonymous by getting your name or data purged from the Internet, once it is already out there. Attempts at such censorship have backfired repeatedly and spectacularly, giving rise to the term Streisand effect. A recent lawsuit provides the latest demonstration: two convicted German killers (who have completed their prison sentences) are attempting to prevent Wikipedia from identifying them.

The law in Germany tries to “protect the name and likenesses of private persons from unwanted publicity.” Of course, the Wikimedia foundation is based in the United States, and this attempt runs head-on into the First Amendment, the right to Free Speech. European countries have a variety of restrictions on speech—Holocaust denial is illegal, for instance. But there is little doubt about how U.S. courts will see the issue; Jennifer Granick of the EFF has a nice write-up.

The aspect that interests me is that even if there weren’t a Free Speech issue, it would be utterly impossible for the court system to keep the names of these men from the Internet. I wonder if the German judge who awarded a judgment against the Wikimedia foundation was aware that it would achieve exactly the “unwanted publicity” that the law was intended to avoid. He would probably have ruled as he did in any case, but it is interesting to speculate.

Legislators, on the other hand, would do well to be aware of the limitations of censorship, and the need to update laws to reflect the rules of the information age. There are always alternatives, although they usually involve trade-offs. In this instance, perhaps one option is a state-supplied alternate identity, analogous to the Witness Protection Program?

Returning to the issue of enforceability, the European doctrine apparently falls under “rights of the personality,” specifically the “right to be forgotten,” according to this paper that discusses the trans-atlantic clash. I find the very name rather absurd; it reminds me of attempting not to think of an elephant (try it!)

The above paper, written from the European perspective, laments the irreconcilable differences between the two viewpoints on the issue of Free Speech vs. Privacy. However, there is no discussion of enforceability. The author does suspect, in the final paragraph, that the European doctrine will become rather meaningless due to the Internet, but he believes this to be purely a consequence of the fact that the U.S. courts have put Free Speech first.

I don’t buy it—even if the U.S. courts joined Europe in recognizing a “right to be forgotten,” it would still be essentially unenforceable. Copyright-based rather than privacy-based censorship attempts offer us a lesson here. Copyright law has international scope, due to being standardized by the WIPO, and yet the attempt to take down the AACS encryption key was pitifully unsuccessful.

Taking down a repeat offender (such as a torrent tracker) or a large file (the Windows 2000 source code leak) might be easier. But if we’re talking about a small piece of data, the only factor that seems to matter is the level of public interest in the sensitive information. The only times when censorship of individual facts has been (somewhat) successful in the face of public sentiment is within oppressive regimes with centralized Internet filters.

There are many laws, particularly privacy laws, that need to be revamped for the digital age. What might appear obvious to technologists might be much less apparent to law scholars, lawmakers and the courts. I’ve said it before on this blog, but it bears repeating: there is an acute need for greater interdisciplinary collaboration between technology and the law.

November 28, 2009 at 5:22 am Leave a comment

Oklahoma Abortion Law: Bloggers get it Wrong

The State of Oklahoma just passed legislation requiring that detailed information about every abortion performed in the state be submitted to the State Department of Health. Reports based on this data are to be made publicly available. The controversy around the law gained steam rapidly after bloggers revealed that even though names and addresses of mothers obtaining abortions were not collected, the women could nevertheless be re-identified from the published data based on a variety of other required attributes such as the date of abortion, age and race, county, etc.

As a computer scientist studying re-identification, this was brought to my attention. I was as indignant on hearing about it as the next smug Californian, and I promptly wrote up a blog post analyzing the serious risk of re-identification based on the answers to the 37 questions that each mother must anonymously report. Just before posting it, however, I decided to give the text of the law a more careful reading, and realized that the bloggers have been misinterpreting the law all along.

While it is true that the law requires submitting a detailed form to the Department of Health, the only information that is made public are annual reports with statistical tallies of the number of abortions performed under very broad categories, which presents a negligible to non-existent re-identification risk.

I’m not defending the law; that is outside my sphere of competence. There do appear to be other serious problems with it, outlined in a lawsuit aimed at stopping the law from going into effect. The text of this complaint, as Paul Ohm notes, does not raise the “public posting” claim. Besides, the wording of the law is very ambiguous, and I can certainly see why it might have been misinterpreted.

But I do want to lament the fact that bloggers and special interest groups can start a controversy based on a careless (or less often, deliberate) misunderstanding, and have it amplified by an emerging category of news outlets like the Huffington post, which have the credibility of blogs but a readership approaching traditional media. At this point the outrage becomes self-sustaining, and the factual inaccuracies become impossible to combat. I’m reminded of the affair of the gay sheep.

October 9, 2009 at 6:24 pm 10 comments

Your Morning Commute is Unique: On the Anonymity of Home/Work Location Pairs

Philippe Golle and Kurt Partridge of PARC have a cute paper (pdf) on the anonymity of geo-location data. They analyze data from the U.S. Census and show that for the average person, knowing their approximate home and work locations — to a block level — identifies them uniquely.

Even if we look at the much coarser granularity of a census tract — tracts correspond roughly to ZIP codes; there are on average 1,500 people per census tract — for the average person, there are only around 20 other people who share the same home and work location. There’s more: 5% of people are uniquely identified by their home and work locations even if it is known only at the census tract level. One reason for this is that people who live and work in very different areas (say, different counties) are much more easily identifiable, as one might expect.

The paper is timely, because Location Based Services  are proliferating rapidly. To understand the privacy threats, we need to ask the two usual questions:

  1. who has access to anonymized location data?
  2. how can they get access to auxiliary data linking people to location pairs, which they can then use to carry out re-identification?

The authors don’t say much about these questions, but that’s probably because there are too many possibilities to list! In this post I will examine a few.

GPS navigation. This is the most obvious application that comes to mind, and probably the most privacy-sensitive: there have been many controversies around tracking of vehicle movements, such as NYC cab drivers threatening to strike. The privacy goal is to keep the location trail of the user/vehicle unknown even to the service provider — unlike in the context of social networks, people often don’t even trust the service provider. There are several papers on anonymizing GPS-related queries, but there doesn’t seem to be much you can do to hide the origin and destination except via charmingly unrealistic cryptographic protocols.

The accuracy of GPS is a few tens or few hundreds of feet, which is the same order of magnitude as a city block. So your daily commute is pretty much unique. If you took a (GPS-enabled) cab home from work at a certain time, there’s a good chance the trip can be tied to you. If you made a detour to stop somewhere, the location of your stop can probably be determined. This is true even if there is no record tying you to a specific vehicle.

ScreenshotLocation based social networking. Pretty soon, every smartphone will be capable of running applications that transmit location data to web services. Google Latitude and Loopt are two of the major players in this space, providing some very nifty social networking functionality on top of location awareness. It is quite tempting for service providers to outsource research/data-mining by sharing de-identified data. I don’t know if anything of the sort is being done yet, but I think it is clear that de-identification would offer very little privacy protection in this context. If a pair of locations is uniquely identifying, a trail is emphatically so.

The same threat also applies to data being subpoena’d, so data retention policies need to take into consideration the uselessness of anonymizing location data.

I don’t know if cellular carriers themselves collect a location trail from phones as a matter of course. Any idea?

Plain old web browsing. Every website worth the name identifies you with a cookie, whether you log in or not. So if you browse the web from a laptop or mobile phone from both home and work, your home and work IP addresses can be tied together based on the cookie. There are a number of free or paid databases for turning IP addresses into geographical locations. These are generally accurate up to the city level, but beyond that the accuracy is shaky.

A more accurate location fix can be obtained by IDing WiFi access points. This is a curious technological marvel that is not widely known. Skyhook, Inc. has spent years wardriving the country (and abroad) to map out the MAC addresses of wireless routers. Given the MAC address of an access point, their database can tell you where it is located. There are browser add-ons that query Skyhook’s database and determine the user’s current location. Note that you don’t have to be browsing wirelessly — all you need is at least one WiFi access point within range. This information can then be transmitted to websites which can provide location-based functionality; Opera, in particular, has teamed up with Skyhook and is “looking forward to a future where geolocation data is as assumed part of the browsing experience.” The protocol by which the browser communicates geolocation to the website is being standardized by the W3C.

The good news from the privacy standpoint is that the accurate geolocation technologies like the Skyhook plug-in (and a competing offering that is part of Google Gears) require user consent. However, I anticipate that once the plug-ins become common, websites will entice users to enable access by (correctly) pointing out that their location can only be determined to within a few hundred meters, and users will leave themselves vulnerable to inference attacks that make use of location pairs rather than individual locations.

Image metadata. An increasing number of cameras these days have (GPS-based) geotagging built-in and enabled by default. Even more awesome is the Eye-Fi card, which automatically uploads pictures you snap to Flickr (or any of dozens of other image sharing websites you can pick from) by connecting to available WiFi access points nearby. Some versions of the card do automatic geotagging in addition.

If you regularly post pseudonymously to (say) Flickr, then the geolocations of your pictures will probably reveal prominent clusters around the places you frequent, including your home and work. This can be combined with auxiliary data to tie the pictures to your identity.

Now let us turn to the other major question: what are the sources of auxiliary data that might link location pairs to identities? The easiest approach is probably to buy data from Acxiom, or another provider of direct-marketing address lists. Knowing approximate home and work locations, all that the attacker needs to do is to obtain data corresponding to both neighborhoods and do a “join,” i.e, find the (hopefully) unique common individual. This should be easy with Axciom, which lets you filter the list by  “DMA code, census tract, state, MSA code, congressional district, census block group, county, ZIP code, ZIP range, radius, multi-location radius, carrier route, CBSA (whatever that is), area code, and phone prefix.”

Google and Facebook also know my home and work addresses, because I gave them that information. I expect that other major social networking sites also have such information on tens of millions of users. When one of these sites is the adversary — such as when you’re trying to browse anonymously — the adversary already has access to the auxiliary data. Google’s power in this context is amplified by the fact that they own DoubleClick, which lets them tie together your browsing activity on any number of different websites that are tracked by DoubleClick cookies.

Finally, while I’ve talked about image data being the target of de-anonymization, it may equally well be used as the auxiliary information that links a location pair to an identity — a non-anonymous Flickr account with sufficiently many geotagged photos probably reveals an identifiable user’s home and work locations. (Some attack techniques that I describe on this blog, such as crawling image metadata from Flickr to reveal people’s home and work locations, are computationally expensive to carry out on a large scale but not algorithmically hard; such attacks, as can be expected, will rapidly become more feasible with time.)

devicesSummary. A number of devices in our daily lives transmit our physical location to service providers whom we don’t necessarily trust, and who keep might keep this data around or transmit it to third parties we don’t know about. The average user simply doesn’t have the patience to analyze and understand the privacy implications, making anonymity a misleadingly simple way to assuage their concerns. Unfortunately, anonymity breaks down very quickly when more than one location is associated with a person, as is usually the case.

May 13, 2009 at 6:42 am 24 comments

Is Anonymity Research Ethical?

A researcher who is working on writing style analysis (“stylometry”), after reading my post on related de-anonymization techniques, wonders what the positive impact of such research could be, given my statement that the malicious uses of the technology are far greater than the beneficial ones. He says:

Sometimes when I’m thinking of an interesting research topic it’s hard to forget the Patton Oswalt line “Hey, we made cancer airborne and contagious! You’re welcome! We’re science: we’re all about coulda, not shoulda.”

This was my answer:

To me, generic research on algorithms always has a positive impact (if you’re breaking a specific website or system, that’s a different story; a bioweapon is a whole different category.) I do not recognize a moral question here, and therefore it does not affect what I choose to work on.

My belief that the research will have a positive impact is not at odds with my belief that the uses of the technology are predominantly evil.  In fact, the two are positively correlated. If we’re talking about web search technology, if academics don’t invent it, then (benevolent) companies will. But if we’re talking about de-anonymization technology, if we don’t do it, then malevolent entities will invent it (if they haven’t already), and of course, keep it to themselves. It comes down to a choice between a world where everyone has access to de-anonymization techniques, and hopefully defenses against it, versus one in which only the bad guys do. I think it’s pretty clear which world most people will choose to live in.

I realize I lean toward the “coulda” side of the question of whether Science is—or should be—amoral. Someone like Prof. Benjamin Kuipers here at UT seems to be close to the other end of the spectrum: he won’t take any DARPA money.

Part of the problem with allowing morality to affect the direction of science is that it is often arbitrary. The Patton Oswalt quote above is a perfect example: he apparently said that in response to news of science enabling a 63 year old woman to give birth. The notion that something is wrong simply because it is not “natural” is one that I find most repugnant. If the freedom of a 63 year old woman to give birth is not an important issue to you, let me note that more serious issues such as stem cell research, that could save lives, fall under the same category.

Going back to anonymity, it is interesting that tools like Tor face much criticism, but for enabling the anonymity of “bad” people rather than breaking the anonymity of “good” people. Who is to be the arbiter of the line between good and bad? I share the opinion of most techies that Tor is a wonderful thing for the world to have.

There are many sides to this issue and many possible views. I’d love to hear your thoughts.

April 9, 2009 at 8:42 pm 8 comments

De-anonymizing Social Networks

Our social networks paper is finally officially out! It will be appearing at this year’s IEEE S&P (Oakland).

Download: PDF | PS | HTML

Please read the FAQ about the paper.

Abstract:

Operators of online social networks are increasingly sharing potentially sensitive information about users and their relationships with advertisers, application developers, and data-mining researchers. Privacy is typically protected by anonymization, i.e., removing names, addresses, etc.

We present a framework for analyzing privacy and anonymity in social networks and develop a new re-identification algorithm targeting anonymized social-network graphs. To demonstrate its effectiveness on real-world networks, we show that a third of the users who can be verified to have accounts on both Twitter, a popular microblogging service, and Flickr, an online photo-sharing site, can be re-identified in the anonymous Twitter graph with only a 12% error rate.

Our de-anonymization algorithm is based purely on the network topology, does not require creation of a large number of dummy “sybil” nodes, is robust to noise and all existing defenses, and works even when the overlap between the target network and the adversary’s auxiliary information is small.

The HTML version was produced using  my Project Luther software, which in my opinion produces much prettier output than anything else (especially math formulas). Another big benefit is the handling of citations: it automatically searches various bibliographic databases and adds abstract/bibtex/download links and even finds and adds links to author homepages in the bib entries.

I have never formally announced or released Luther; it needs more work before it can be generally usable, and my time is limited. Drop me a line if you’re interested in using it.

March 19, 2009 at 11:09 am 18 comments

Anonymous Data Collection: Lessons from the A-Rod Affair

Recently, the Alex Rodriguez steroid controversy has been in the news. The aspect that interests me is the manner in which it came to attention: A-Rod provided a urine sample as part of a supposedly anonymous survey of Major League Baseball players in 2003, the goal of which was to determine if more than 5% of players were using banned substances. When Federal agents came calling, the sample turned out to be not so anonymous after all.

The failure of anonymity here was total–the testing lab simply failed to destroy the samples or even take the labels off them, and the Players’ Union, which conducted the survey, failed to call the lab and ask them to do so during the more than one-week window that they had before the subpoena was issued.

However, there are a number of ways in which things could have gone wrong even if one or more of the parties had followed proper procedure. None of the scenarios below result in as straightforward an association between player and steroid use as we have seen. On the other hand, they can be just as damaging in the court of public opinion.

  • If the samples were not destroyed, but simply de-identified, DNA can be recovered even after years, and the DNA can be used to match the player to the sample. You might argue the feds can’t easily get hold of players’ DNA to run such a matching, but once the association between drug test result and DNA has been made, it is a sword of Damocles hanging over the player’s head (note that A-Rod’s drug test happened six years ago.) The trend in recent years has been toward increased DNA profiling and bigger and bigger databases, and unlabeled samples therefore pose a clear danger.
  • If the samples are destroyed, and the test results are stored in de-identified form, anonymity could still be compromised. A drug test measures the concentrations of a bunch of different chemicals in the urine. It is likely that this results in a “profile” that is characteristic of a person–just like a variety of other biometric characteristics. If the same player, having stopped the use of banned substances, provides another urine sample, it is possible that this profile can be matched to the old one based on the fact that most of the urine chemicals have not changed in concentration. It is an interesting research question to see how stable the “profiles” are, and what their discriminatory power is.
  • Even more sophisticated attacks are possible. Let’s say that participant names are known, but other than that the only thing that’s released is a single statistic: the percentage of players that tested positive. Now, if the survey is performed on a regular basis, and a certain player (who happens to use steroids) participates only some of the time, the overall statistic is going to be slightly higher whenever that player participates. In spite of confounding factors, such as the fact that other players might also drop in and out, statistical techniques can be used to tease out this correlation. 

    This might sound like a tall order at first, but it is a proven attack strategy. The technique was used recently in a PLoS Genetics paper to identify if an individual had contributed DNA to an aggregate sample of hundreds of individuals. 

    I performed a quick experiment, assuming that there are 1,000 players in the sample, of which 100 participate half the time (the rest participate all the time). 5% of the players dope, and each player either dopes throughout the study period or not at all. Testing is done every 3 months; the list of participants in each wave of the survey is known, as well as the percentage of players who tested positive in each wave. I found that after 3 years, there is enough information to identify 80% of the cheating players who participate irregularly. (Players who participate regularly are clearly safe.) 

    [Technical note: that's an equal error rate of 20%; i.e, 20% of the cheating players are not accused, and 20% of the accused are innocent. There is a trade-off between the two numbers, as always; if a higher accuracy is required, say only 10% of accused players are innocent, then 65% of the cheating players can be identified.]

  • When applicable, a combination of the above techniques such as matching de-identified profiles across different time-periods of a survey (or different surveys) can greatly increase the attacker’s potential.

The point of the above scenarios is to convince you that you can never, ever be certain that the connection between a person and their data has been definitively severed. Regular readers of this blog will know that this is a recurring theme of my research. The quantity of data being collected today and the computational power available have destroyed the traditional and ingrained assumptions about anonymity. Well-established procedures have been shown to be completely inadequate, and it is far from clear that things can be fixed. Anyone who cares about their privacy must be vigilant against giving up their data under false promises of anonymity.

February 19, 2009 at 2:24 am Leave a comment

De-anonymizing the Internet

I’ve been thinking about this problem for quite a while: is it possible to de-anonymize text that is posted anonymously on the Internet by matching the writing style with other Web pages/posts where the authorship is known? I’ve discussed this with many privacy researchers but until recently never written anything down. When someone asked essentially the same question on Hacker News, I barfed up a stream of thought on the subject :-) Here it is, lightly edited.

Each one of us has a writing style that is idiosyncratic enough to have a unique “fingerprint”. However, it is an open question whether it can be efficiently extracted.

The basic idea for constructing a fingerprint is this. Consider two words that are nearly interchangeable, say ‘since’ and ‘because’. Different people use the two words in a differing proportion. By comparing the relative frequency of the two words, you get a little bit of information about a person, typically under 1 bit. But by putting together enough of these ‘markers’, you can construct a profile.

The beginning of modern, rigorous research in this field was by Mosteller and Wallace in 1964: they identified the author of the disputed Federalist papers, almost 200 years after they were written (note that there were only three possible candidates!). They got on the cover of TIME, apparently. Other “coups” for writing-style de-anonymization are the identification of the author of Primary Colors, as well as the unabomber (his brother recognized his style, it wasn’t done by statistical/computational means).

The current state of the art is summarized in this bibliography. Now, that list stops at 2005, but I’m assuming there haven’t been earth-shattering changes since then. I’m familiar with the results from those papers; the curious thing is that they stop at corpuses of a couple hundred authors or so — i.e, identifying one anonymous poster out of say 200, rather than a million. This is probably because they had different applications in mind, such as identification within a company, instead of Internet-scale de-anonymization. Note that the amount of information you need is always logarithmic in the potential number of authors, and so if you can do 200 authors you can almost definitely push it to a few tens of thousands of authors.

The other interesting thing is that the papers are fixated with ‘topic-free’ identification, where the texts aren’t about a particular topic, making the problem harder. The good news is that when you’re doing this Internet-scale, nobody is stopping you from using topic information, making it a lot easier.

So my educated guess is that Internet-scale writing style de-anonymization is possible. However, you’d need fairly long texts, perhaps a page or two. It’s doubtful that anything can be done with a single average-length email.

Another potential de-anonymization strategy is to use typing pattern fingerprinting (keystroke dynamics), i.e, analyzing the timing between our keystrokes (yes, this works even for non-touch typists.) This is already used in commercial products as an additional factor in password authentication. However, the implications for de-anonymization have not been explored, and I think it’s very, very feasible. i.e, if google were to insert javascript into gmail to fingerprint you when you were logged in, they could use the same javascript to identify you on any web page where you type in text even if you don’t identify yourself. Now think about the de-anonymization possibilities you can get by combining analysis of writing style and keystroke dynamics…

By the way, make no mistake: the malicious uses of this far overwhelm the benevolent uses. Once this technology becomes available, it will be very hard to post anonymously at all. Think of the consequences for political dissent or whistleblowers. The great firewall of China could simply insert a piece of javascript into every web page, and poof, there goes the anonymity of everyone in China.

It think it’s likely that one can build a tool to protect anonymity by taking a chunk of writing and removing your fingerprint from it, but it will need a lot of work, and will probably lead to a cat-and-mouse game between improved de-anonymization and obfuscation techniques. Note the caveats, however: most ordinary people will not have the foreknowledge to find and use such a tool. Second, think of all the compromising posts — rants about employers, accounts from cheating spouses, political dissent, etc. — that have already been written. The day will come when some kid will download a script, let a crawler loose on the web, and post the de-anonymized results for all to see. There will be interesting consequences.

If you’re interested in working on this problem–either writing style analysis for breaking anonymity or obfuscation techniques for protecting anonymity–drop me a line.

January 15, 2009 at 3:16 am 21 comments

The Fallacy of Anonymous Institutions

The graph below is from the paper “Chains of affection: The structure of adolescent romantic and sexual networks.” The name of the school that the data was collected from is not revealed, and is given the working name “Jefferson High.” It is part of the National Longitudinal Study of Adolescent Health, containing very detailed health information on 100,000 high school students in 140 schools. In 12 of the schools, the entire sexual network was mapped out.

Clearly, the authors felt that concealing the identity of the school is important for protecting the privacy of the participants. It’s not hard to see why: firstly, the aggregate information presented in the study could by itself be unpleasant, especially facts about the prevalence of adolescent sexual activity in a conservative rural town (see below). Second, and more importantly, knowing the identity of the school can lead to further de-anonymization of the individuals in the network.

The graph above is rich enough that a few individuals can identify themselves purely based on the local information available to them, and thus learn things about their neighbors in the graph. A group of individuals getting together will have an even easier time of it. Furthermore, the actual paper provides a richer, temporally ordered version of the graph above.

But even strangers may benefit: depending on how well the temporal information in the sexual graph correlates with other temporal information that may be available, say from Facebook, de-anonymization might be possible with little or no co-operation from the subjects themselves. Soon, I will have more to say about research results on de-anonymizing graphs with loosely correlated external/auxiliary data.

Having established the privacy risk, let’s see how easy it is to re-identify Jefferson High. The authors give us these helpful clues:

“Jefferson High School” is an almost all-white high school of roughly 1000 students located in a mid-sized mid-western town. Jefferson High is the only public high school in the town. The town, “Jefferson City” is over an hour away by car from the nearest large city. Jefferson City is surrounded by beautiful countryside, home to many agricultural enterprises. The town itself is working class, although there remain some vestiges of better times. At one period, the town served as a resort for city dwellers, drawing an annual influx of summer visitors. This is no longer the case, and many of the old resort properties show signs of decay. The community is densely settled. At the time of our fieldwork, students were reacting to the deaths of two girls killed in an automobile accident.

Some further facts presented have high amusement value, and are equally useful for re-identification:

Jefferson students earn lower grades, are suspended more, feel less attached to school, and come from poorer families than those at comparable schools. They are more likely than students in other high schools to have trouble paying attention, have lower self-esteem, pray more, have fewer expectations about college, and are more likely to have a permanent tattoo.  Compared to other students in large disproportionately white schools, adolescents in Jefferson High are more likely to drink until they are drunk. In schools of comparable race and size, on average 30% of 10th-12th grade students smoke cigarettes regularly, whereas in Jefferson, 36% of all 10th to 12th graders smoke. Drug use is moderate, comparable to national norms.  Somewhat more than half of all students report having had sex, a rate comparable to the national average, and only slightly higher than observed for schools similar with respect to race and size.  Nevertheless, if Jefferson is not Middletown, it looks like an awful lot like it. The adolescents at Jefferson High are pretty normal. In describing the events of the past year, many students report that there is absolutely nothing to do in Jefferson. For fun, students like to drive to the outskirts of town and get drunk. Jefferson is a close-knit insular predominantly working-class community which offers few activities for its youth.

A database of public schools in the U.S. is available for sale for $75, containing very detailed information about each school. I’m quite confident that the information in there is sufficient to re-identify Jefferson High.

This thesis of this blog that the amount of entropy required to de-anonymize an individual — 33 bits — is low enough that it doesn’t offer meaningful protection in most circumstances. Obviously, the argument applies even more strongly to the anonymity of a well-defined group of people.

Let’s be clear: the paper is from 1994; who slept with whom in high school is not a huge deal a decade and a half later. However, the problem is systemic, and IRBs (Institutional Review Boards) keep blithely approving releases of data with such nominal de-identification applied. The re-identification of the institutional affiliation of an entire population of a study is of more concern from the privacy perspective than the de-anonymization of individual identities: it needs to be done only once, and affects hundreds or thousands of individuals.

Recently, a group of researchers from the Berkman Center released a dataset of Facebook profile information from an entire cohort (the class of 2009) of college students from “an anonymous, northeastern American university.” It was promptly de-anonymized by Michael Zimmer, who revealed that it was Harvard College:

As I noted here, the press release and the public codebook for the dataset provided many clues to where the data came from: we know it is a northeastern US university, it is private, co-ed, and whose class of 2009 initially had 1640 students in it. A quick search for schools reveals there are only 7 private, co-ed colleges in New England states (CT, ME, MA, NH, R , VT) with total undergraduate populations between 5000 and 7500 students (a likely range if there were 1640 in the 2006 freshman class): Tufts University, Suffolk University, Yale University, University of Hartford, Quinnipiac University, Brown University, and Harvard College.

[...]

Finally, and perhaps most convincingly, only Harvard College offers the specific variety of the subjects’ majors that are listed in the codebook. While nearly all univerersities offer the common majors of “History”, “Chemistry” or “Economics”, one only needs to search for the more uniquely phrased majors to discover a shared home institution.

Another amusing example is a paper on mobile phone call graphs which attempts to keep the identity of an entire country secret. I found that the approximate population of the country reported in the paper together with the mobile phone penetration rate is sufficient to uniquely identify it.

Suppressing the identity of your study population has some privacy benefits: at least, it won’t show up in google searches. But relying on it for any kind of serious privacy protection would be foolish. Scrubbing an entire dataset or research paper of clues about the study population can be hard or impossible; further, a single study participant corroborating the published results or methodology might be sufficient for de-anonymization of the group. The only solution is therefore to assume that the identity of the study population will be discovered, and to try to ensure that individual identities will still be safe from re-identification.

December 15, 2008 at 10:48 am 3 comments

Older Posts Newer Posts


About 33bits.org

I'm an assistant professor of computer science at Princeton. I research (and teach) information privacy and security, and moonlight in technology policy.

This is a blog about my research on breaking data anonymization, and more broadly about information privacy, law and policy.

For an explanation of the blog title and more info, see the About page.

Subscribe

Be notified when there's a new post — subscribe to the feed, follow me on Google+ or twitter or use the email subscription box below.

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 213 other followers


Follow

Get every new post delivered to your Inbox.

Join 213 other followers