Posts Tagged anonymity

Ubercookies Part 2: History Stealing meets the Social Web

Recap. In the previous article I introduced ubercookies — techniques that websites can use to de-anonymize visitors. I discussed a recent paper that shows how to use history stealing along with social network group membership information to find the visitor’s identity, and I promised a stronger variant of the attack.

The observation that led me to the attack I’m going to describe is simple: social networking isn’t just about social networks — the whole web has gone social. It’s a view that you quickly internalize if you spend any time hanging out with Silicon Valley web entrepreneurs :-)

Let’s break the underlying principle of the identity-stealing attack down to its essence:

A user leaves a footprint whenever their interaction with a specific web page is recorded publicly.

De-anonymization happens when the attacker can tie these footprints together into “trails” that can then be correlated with the user’s browser history. Efficiently querying the history to identify multiple points on the trail is a challenging problem to solve, but in principle de-anonymization is possible as long as the user’s actions on different web pages happen under the same identity.

Footprints can be tied together into trails as long as all the interactions happen under the same identity. There is no need for the interactions to be on the same website.

There are two major ways in which you can interact with arbitrary websites under a unified identity, both of which are defining principles of the social web. The first is federated identity, which means you can use the same identity provider wherever you go. This is achieved through OpenID and similar mechanisms like Facebook Connect. The second is social sharing: whenever you find something interesting anywhere on the web, you feed it back to your social network.

Now let’s examine the different types of interactions in more detail.

A taxonomy of interaction on the social web.

0. The pre-social web had no social networks and no delegated identity mechanism (except for the failed attempt by Microsoft called Passport). Users created new identities on each website, authenticated via site-specific usernames and passwords to each site separately. The footprints on different sites cannot be tied together; for practical purposes there are no footprints.

1. Social networks: affiliation. In social networks, users interact with social objects and leave footprints when the actions are public. The key type of interaction that is useful for de-anonymization is the expression of affiliation: this refers to not just the group memberships studied in the recent Wondracek et al. paper, but also includes

  • memberships of fan pages on Facebook
  • “interests” on Livejournal
  • follow relationships and plain old friend relationships on Twitter and other public social networks
  • subscriptions to Youtube channels

and so on.

All of these interactions, albeit very different from the user perspective, are fundamentally the same concept:

  • you “add yourself” to or affiliate yourself with some object on a social network
  • this action can be publicly observed
  • you almost certainly visited a URL that identifies the object before adding it.
As you can imagine, these actions leave a trail.

2. The social web: sharing. When you find a page you like — any page at all — you can import it or “share” it to your social stream, on Facebook, Twitter, Google Buzz, or a social bookmarking site like Delicious. The URL of the page is almost certainly in your history, and as long as your social stream is public, your interaction was recorded publicly.


3. The social web: federated identity. When you’re reading a blog post or article on the social web, you can typically comment on it, “like” it, favorite it, rate it, etc. You do all this under your Facebook, Google or other unified identity. These actions are often public and when they are, your footprint is left on the page.

A taxonomy of attacks

The three types of social interactions above give rise to a neat taxonomy of attacks. They involve progressively easier backend processing and progressively more sophisticated history search techniques on the front end. But the execution time on the front-end doesn’t increase, so it is a net win. Here’s a table:

Type of interaction
Backend processing
Type of history URL
Location of footprint
Affiliation Crawling of social network Object in a social network In the social network
Sharing Syndication of social stream(s) from social network Any page In the social network
Federated identity None; optional crawling Any page On the page

.

1. Better use of affiliation information. The Wondracek et al. paper makes use of only group membership. One natural reason to choose groups is that there are many groups that are large, with thousands of members, so it gives us a reasonably high chance that by throwing darts in the browser history we will actually hit a few groups that the user has visited. On the other hand, if we try to use the Facebook friend list, for example, hoping to find one of the user’s friends by random chance, it probably won’t work because most users have only a few hundred friends.

But wait: many Twitter users have thousands or even millions of followers. These are known as “hubs” in network theory. Clearly, the attack will work for any kind of hubs that have predictable URLs, and users on Twitter have even more predictable URLs (twitter.com/username) than groups on various networks. The attack will also work using Youtube favorites (which show up by default on the user’s public profile or channel page) and whatever other types of affiliation we might choose to exploit, as long as there are “hubs” — nodes in the graph with high degree. Already we can see that many more websites are vulnerable than the authors envisaged.

2. Syndicating the social stream: my Delicious experiment.

The interesting thing about the social stream is that you can syndicate the stream of interactions, rather than crawling. The reasons why syndication is much easier than crawling are more practical than theoretical. First, syndicated data is intended to be machine readable, and is therefore smaller as well as easier to parse compared to scraping web pages. Second, and more importantly, you might be to get a feed of the entire site-wide activity instead of syndicating each user’s activity stream separately. Delicious allows global syndication; Twitter plans to open this “firehose” feature to all developers soon.

Another advantage of the social stream is that everything is timestamped, so you can limit yourself to recent interactions, which are more likely to be in the user’s history.

Using the delicious.com dataset made available by DAI-labor (a log of all bookmarking activity on delicious.com over several years), I did a simulated experiment using 3 months worth of data: assuming that users keep their history around for 3 months, do in fact visit every link they post on delicious, how many users would a hypothetical history stealing attack be able to identify? I had a pretty good success rate: about 60% of the users who had shared at least 2 links in the 3-month period, or about 300,000 users. This takes at most 4000-5000 Javascript history queries.

Needless to say, once Twitter opens up its firehose, Twitter users (who are far more numerous than delicious users) would also be susceptible to the same technique.

This attack is not possible to fix via server-side URL randomization. It can also be made to work using Facebook, Google Buzz, and other sharing platforms, although the backend processing required won’t be as trivial (but probably no harder than in the original attack.)

3. A somewhat random walk through the history park.

And now for an approach that potentially requires no backend data collection, although it is speculative and I can’t guess what the success rate would be. The attack proceeds in several steps:

  1. Identify the user’s interests by testing if they’ve visited various popular topic-specific sites. Pick one of the user’s favorite topics. Incidentally, a commenter on my previous post notes he is building exactly this capability using topic pages on Wikipedia, also with the goal of de-anonymization!
  2. Grab a list of the top blogs on the topic you picked from one of the blog directories. Query the history to see which of these blogs the user reads frequently. It is even possible to estimate the level of interest in a blogs by looking at the fraction of the top/recent posts from that blog that the user has visited. Pick a blog that the user seems to visit regularly.
  3. Look for evidence of the user leaving comments on posts. For example, on Blogger, the comment page for a post has the URL http://www.blogger.com/comment.g?blogID=<blogid>&postID=<postid>.
  4. Once you find a couple of posts where it looks like the user made a comment, scrape the list of people who commented on it, find the intersection. (Even a single comment might suffice; as long as you have a list of candidates, you easily verify if it’s one of them by testing user-specific URLs. More below.)
  5. Depending on the blogging platform, you might even be able to deduce that the user responded (or intended to respond) to a specific comment. For example, On wordpress you have the pattern http://<blogname>.wordpress.com/<postname>/?replytocom=<commentid>#respond. If you get lucky and find one of those patterns, that makes things even easier.

If at first you don’t succeed, pick a different blog and repeat.

I suspect that the most practical method would be to use a syndicated activity stream from a social network, but also to use the heuristics presented above to more efficiently search through the history.

Epilogue: Identity.

Not only has there been a movement towards a small number of identity providers on the web, there are many aggregators out there that have sprung up in order to automatically find the connections between identities across the different identity providers, and also connect online identities to physical-world databases. As Pete Warden notes:

One of the least-understood developments of the last few years is the growth of databases of personal information linked to email addresses. Rapleaf is probably the leader in this field, but even Flickr lets companies search their API for users based on an email address.

I ran my email address through his demo script and it is quite clear that virtually all of my online identities have been linked together. This is getting to be the norm; as a consequence, once an attacker gets any kind of handle on you, they can go “identity hopping” and find out a whole lot more about you.

This is also the reason that once the attacker can make a reasonable guess at the visitor’s identity, it’s easy to verify the guess. Not only can they look for user-specific URLs in your history to confirm the guess (described in detail in the Wondracek et al. paper), but all your social streams on other sites can also be combined with your history to corroborate your identity.

Up next in the Ubercookies series: So that’s pretty bad. But it’s going to get worse before it can get better :-) In the next article, I will describe an entirely different attack strategy to get at your identity by exploiting a bug in a specific identity provider’s platform.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

3 comments February 19, 2010

Cookies, Supercookies and Ubercookies: Stealing the Identity of Web Visitors

Synopsis. Highly sticky techniques called supercookies for tracking web visitors are becoming well known. But the state of the art has in fact advanced beyond tracking, allowing a website to learn the identity of a visitor. I will call these techniques ubercookies; this article describes one such recently discovered technique. Future articles in this series will describe even more powerful variants and discuss the implications.

Cookies. Most people are aware that their web browsing activity over time and across sites can be tracked using cookies. When you are being tracked, it can be deduced that the same person visited certain sites at certain times, but the sites doing the tracking don’t know who you are, i.e., you name, etc., unless you choose to tell them in some way, such as by logging in.

Cookies are easy to delete, and so there’s been a big impetus in the Internet advertising industry to discover and deploy more robust tracking mechanisms.

Supercookies. You may surprised to find just how helpless a user is against a site (or more usually, a network of sites) that is truly determined to track them. There are Flash cookies, much harder to delete, some of which respawn the regular HTTP cookies that you delete. The EFF’s Panopticlick project demonstrates many “browser fingerprinting” methods which are more sophisticated. (Jonathan Mayer’s senior thesis contained a smaller-scale demonstration of some of those techniques).

A major underlying reason for a lot of these problems is that any browser feature that allows a website to store “state” on the client can be abused for tracking, and there are a bewildering variety of these. There is a great analysis in a paper by my Stanford colleagues. One of the points they make is that co-operative tracking by websites is essentially impossible to defend against.

Ubercookies: history stealing. Now let’s get to the scary stuff: uncovering identity. History stealing or history sniffing is an unintended consequence of the way the web is designed; it allows a website to learn which URLs you’ve been to. While a site can’t simply ask your browser for a list of visited URLs, it can ask “yes/no” questions and your browser will faithfully respond. The most common way of doing this is by injecting invisible links into the page using Javascript and exploiting the fact that the CSS link color attribute depends on whether the link has been visited or not.

History stealing has been known for a decade, and browser vendors have failed to fix it because it cannot be fixed without sacrificing some useful functionality (the crude way is to turn off visited link coloring altogether; a subtler solution is SafeHistory). Increasingly worse consequences have been discovered over the years: for example, a malicious site can learn which bank you use and customize a phishing page accordingly. But a paper (full text, PDF) coming out at this year’s IEEE S&P conference at Oakland takes it to a new level.

Identity. Let’s pause for a second and think about what finding your identity means. In the modern, social web, social network accounts have become our de-facto online identities, and most people reveal their name and at least some other real-world information about ourselves on our profiles. So if the attacker can discover the URL of your social network profile, we can agree that he has identified you for all practical purposes. And the new paper shows how to do just that.

The attack relies on the following observations:

  1. Almost all social networking sites have some kind of “group” functionality: users can add themselves to groups.
  2. Users typically add themselves to multiple groups, at least some of which are public.
  3. Group affiliations, just like your movie-watching history and many other types of attributes, are sufficient to fingerprint a user. There’s a high chance there’s no one else who belongs to the same set of groups that you do (or is even close). [Aside: I used this fact to show that Lending Club data can be de-anonymized.]
  4. Users who belong to a group are likely to visit group-specific URLs that are predictable.

Put the above facts together, and the attack emerges: the attacker (an arbitrary website you visit, without the co-operation of whichever social network is used as an attack enabler) uses history stealing to test a bunch of group-related URLs one by one until he finds a few (public) groups that the anonymous user probably belongs to. The attacker has already crawled the social network, and therefore knows which user belongs to which groups. Now he puts two and two together: using the list of groups he got from the browser, he does a search on the backend to find the (usually unique) user who belongs to all those groups.

Needless to say, this is a somewhat simplified description. The algorithm can be easily modified so that it will work even if some of the groups have disappeared from your history (say because you clear it once in a while) or if you’ve visited groups you’re not a member of. The authors demonstrated that the attack with real users on the Xing network, and also showed theoretically that it is feasible on a number of other social networks including Facebook and Myspace. It takes a few thousand Javascript queries and runs in a few seconds on modern browsers, which makes it pretty much surreptitious.

Fallout. There are only two ways to try to fix this. The first for all the social networking sites to change their URL patterns by randomizing them so that point 4 above (predictable URL identifying that you belong to a group) is no longer true. The second is for all the browser vendors to fix their browsers so that history stealing is no longer possible.

The authors contacted several of the social networks; Xing quickly implemented the URL randomization fix, which I find surprising and impressive. Ultimately, however, Xing’s move will probably be no more than a nice gesture, for the following reason.

Over the last few days, I have been working on a stronger version of this attack which:

  • can make use of every URL in the browser history to try and identify the user. This means that server-side fixes are not possible, because literally every site on the web would need to implement randomization.
  • avoids the costly crawling step, further lowering the bar to executing the attack.

That leaves browser-based fixes for history stealing, which hasn’t happened in the 10 years that the problem has been known. Will browsers vendors finally accept the functionality hit and deal with the problem? We can hope so, but it remains to be seen.

In the next article, I will describe the stronger attack and also explain in more detail why your profile page on almost any website is a very strong identifier.

Thanks to Adam Bossy for reviewing a draft.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

18 comments February 18, 2010

The Secret Life of Data

Some people claim that re-identification attacks don’t matter, the reasoning being: “I’m not important enough for anyone to want to invest time on learning private facts about me.” At first sight that seems like a reasonable argument, at least in the context of the re-identification algorithms I have worked on, which require considerable human and machine effort to implement.

The argument is nonetheless fallacious, because re-identification typically doesn’t happen at the level of the individual. Rather, the investment of effort yields results over the entire database of millions of people (hence the emphasis on “large-scale” or “en masse”.) On the other hand, the harm that occurs from re-identification affects individuals. This asymmetry exists because the party interested in re-identifying you and the party carrying out the re-identification are not the same.

In today’s world, the entities most interested in acquiring and de-anonymizing large databases might be data aggregation companies like ChoicePoint that sell intelligence on individuals, whereas the party interested in using the re-identified information about you would be their clients/customers: law enforcement, an employer, an insurance company, or even a former friend out to slander you.

Data passes through multiple companies or entities before reaching its destination, making it hard to prove or even detect that it originated from a de-anonymized database. There are lots of companies known to sell “anonymized” customer data: for example Practice Fusion “subsidizes its free EMRs by selling de-identified data to insurance groups, clinical researchers and pharmaceutical companies.” On the other hand, companies carrying out data aggregation/de-anonymization are a lot more secretive about it.

Another piece of the puzzle is what happens when a company goes bankrupt. Decode genetics recently did, which is particularly interesting because they are sitting on a ton of genetic data. There are privacy assurances in place in their original Terms of Service with their customers, but will that bind the new owner of the assets? These are legal gray areas, and are frequently exploited by companies looking to acquire data.

At the recent FTC privacy roundtable, Scott Taylor of Hewlett Packard said his company regularly had the problem of not being able to determine where data is being shared downstream after the first point of contact. I’m sure the same is true of other companies as well. (How then could we possibly expect third-party oversight of this process?)  Since data fuels the modern Web economy, I suspect that the process of moving data around will continue to become more common as well as more complex, with more steps in the chain. We could use a good name for it — “data laundering,” perhaps?

1 comment February 6, 2010

The Internet has no Delete Button: Limits of the Legal System in Protecting Anonymity

It is futile to try to stay anonymous by getting your name or data purged from the Internet, once it is already out there. Attempts at such censorship have backfired repeatedly and spectacularly, giving rise to the term Streisand effect. A recent lawsuit provides the latest demonstration: two convicted German killers (who have completed their prison sentences) are attempting to prevent Wikipedia from identifying them.

The law in Germany tries to “protect the name and likenesses of private persons from unwanted publicity.” Of course, the Wikimedia foundation is based in the United States, and this attempt runs head-on into the First Amendment, the right to Free Speech. European countries have a variety of restrictions on speech—Holocaust denial is illegal, for instance. But there is little doubt about how U.S. courts will see the issue; Jennifer Granick of the EFF has a nice write-up.

The aspect that interests me is that even if there weren’t a Free Speech issue, it would be utterly impossible for the court system to keep the names of these men from the Internet. I wonder if the German judge who awarded a judgment against the Wikimedia foundation was aware that it would achieve exactly the “unwanted publicity” that the law was intended to avoid. He would probably have ruled as he did in any case, but it is interesting to speculate.

Legislators, on the other hand, would do well to be aware of the limitations of censorship, and the need to update laws to reflect the rules of the information age. There are always alternatives, although they usually involve trade-offs. In this instance, perhaps one option is a state-supplied alternate identity, analogous to the Witness Protection Program?

Returning to the issue of enforceability, the European doctrine apparently falls under “rights of the personality,” specifically the “right to be forgotten,” according to this paper that discusses the trans-atlantic clash. I find the very name rather absurd; it reminds me of attempting not to think of an elephant (try it!)

The above paper, written from the European perspective, laments the irreconcilable differences between the two viewpoints on the issue of Free Speech vs. Privacy. However, there is no discussion of enforceability. The author does suspect, in the final paragraph, that the European doctrine will become rather meaningless due to the Internet, but he believes this to be purely a consequence of the fact that the U.S. courts have put Free Speech first.

I don’t buy it—even if the U.S. courts joined Europe in recognizing a “right to be forgotten,” it would still be essentially unenforceable. Copyright-based rather than privacy-based censorship attempts offer us a lesson here. Copyright law has international scope, due to being standardized by the WIPO, and yet the attempt to take down the AACS encryption key was pitifully unsuccessful.

Taking down a repeat offender (such as a torrent tracker) or a large file (the Windows 2000 source code leak) might be easier. But if we’re talking about a small piece of data, the only factor that seems to matter is the level of public interest in the sensitive information. The only times when censorship of individual facts has been (somewhat) successful in the face of public sentiment is within oppressive regimes with centralized Internet filters.

There are many laws, particularly privacy laws, that need to be revamped for the digital age. What might appear obvious to technologists might be much less apparent to law scholars, lawmakers and the courts. I’ve said it before on this blog, but it bears repeating: there is an acute need for greater interdisciplinary collaboration between technology and the law.

Add comment November 28, 2009

Oklahoma Abortion Law: Bloggers get it Wrong

The State of Oklahoma just passed legislation requiring that detailed information about every abortion performed in the state be submitted to the State Department of Health. Reports based on this data are to be made publicly available. The controversy around the law gained steam rapidly after bloggers revealed that even though names and addresses of mothers obtaining abortions were not collected, the women could nevertheless be re-identified from the published data based on a variety of other required attributes such as the date of abortion, age and race, county, etc.

As a computer scientist studying re-identification, this was brought to my attention. I was as indignant on hearing about it as the next smug Californian, and I promptly wrote up a blog post analyzing the serious risk of re-identification based on the answers to the 37 questions that each mother must anonymously report. Just before posting it, however, I decided to give the text of the law a more careful reading, and realized that the bloggers have been misinterpreting the law all along.

While it is true that the law requires submitting a detailed form to the Department of Health, the only information that is made public are annual reports with statistical tallies of the number of abortions performed under very broad categories, which presents a negligible to non-existent re-identification risk.

I’m not defending the law; that is outside my sphere of competence. There do appear to be other serious problems with it, outlined in a lawsuit aimed at stopping the law from going into effect. The text of this complaint, as Paul Ohm notes, does not raise the “public posting” claim. Besides, the wording of the law is very ambiguous, and I can certainly see why it might have been misinterpreted.

But I do want to lament the fact that bloggers and special interest groups can start a controversy based on a careless (or less often, deliberate) misunderstanding, and have it amplified by an emerging category of news outlets like the Huffington post, which have the credibility of blogs but a readership approaching traditional media. At this point the outrage becomes self-sustaining, and the factual inaccuracies become impossible to combat. I’m reminded of the affair of the gay sheep.

10 comments October 9, 2009

Your Morning Commute is Unique: On the Anonymity of Home/Work Location Pairs

Philippe Golle and Kurt Partridge of PARC have a cute paper (pdf) on the anonymity of geo-location data. They analyze data from the U.S. Census and show that for the average person, knowing their approximate home and work locations — to a block level — identifies them uniquely.

Even if we look at the much coarser granularity of a census tract — tracts correspond roughly to ZIP codes; there are on average 1,500 people per census tract — for the average person, there are only around 20 other people who share the same home and work location. There’s more: 5% of people are uniquely identified by their home and work locations even if it is known only at the census tract level. One reason for this is that people who live and work in very different areas (say, different counties) are much more easily identifiable, as one might expect.

The paper is timely, because Location Based Services  are proliferating rapidly. To understand the privacy threats, we need to ask the two usual questions:

  1. who has access to anonymized location data?
  2. how can they get access to auxiliary data linking people to location pairs, which they can then use to carry out re-identification?

The authors don’t say much about these questions, but that’s probably because there are too many possibilities to list! In this post I will examine a few.

GPS navigation. This is the most obvious application that comes to mind, and probably the most privacy-sensitive: there have been many controversies around tracking of vehicle movements, such as NYC cab drivers threatening to strike. The privacy goal is to keep the location trail of the user/vehicle unknown even to the service provider — unlike in the context of social networks, people often don’t even trust the service provider. There are several papers on anonymizing GPS-related queries, but there doesn’t seem to be much you can do to hide the origin and destination except via charmingly unrealistic cryptographic protocols.

The accuracy of GPS is a few tens or few hundreds of feet, which is the same order of magnitude as a city block. So your daily commute is pretty much unique. If you took a (GPS-enabled) cab home from work at a certain time, there’s a good chance the trip can be tied to you. If you made a detour to stop somewhere, the location of your stop can probably be determined. This is true even if there is no record tying you to a specific vehicle.

ScreenshotLocation based social networking. Pretty soon, every smartphone will be capable of running applications that transmit location data to web services. Google Latitude and Loopt are two of the major players in this space, providing some very nifty social networking functionality on top of location awareness. It is quite tempting for service providers to outsource research/data-mining by sharing de-identified data. I don’t know if anything of the sort is being done yet, but I think it is clear that de-identification would offer very little privacy protection in this context. If a pair of locations is uniquely identifying, a trail is emphatically so.

The same threat also applies to data being subpoena’d, so data retention policies need to take into consideration the uselessness of anonymizing location data.

I don’t know if cellular carriers themselves collect a location trail from phones as a matter of course. Any idea?

Plain old web browsing. Every website worth the name identifies you with a cookie, whether you log in or not. So if you browse the web from a laptop or mobile phone from both home and work, your home and work IP addresses can be tied together based on the cookie. There are a number of free or paid databases for turning IP addresses into geographical locations. These are generally accurate up to the city level, but beyond that the accuracy is shaky.

A more accurate location fix can be obtained by IDing WiFi access points. This is a curious technological marvel that is not widely known. Skyhook, Inc. has spent years wardriving the country (and abroad) to map out the MAC addresses of wireless routers. Given the MAC address of an access point, their database can tell you where it is located. There are browser add-ons that query Skyhook’s database and determine the user’s current location. Note that you don’t have to be browsing wirelessly — all you need is at least one WiFi access point within range. This information can then be transmitted to websites which can provide location-based functionality; Opera, in particular, has teamed up with Skyhook and is “looking forward to a future where geolocation data is as assumed part of the browsing experience.” The protocol by which the browser communicates geolocation to the website is being standardized by the W3C.

The good news from the privacy standpoint is that the accurate geolocation technologies like the Skyhook plug-in (and a competing offering that is part of Google Gears) require user consent. However, I anticipate that once the plug-ins become common, websites will entice users to enable access by (correctly) pointing out that their location can only be determined to within a few hundred meters, and users will leave themselves vulnerable to inference attacks that make use of location pairs rather than individual locations.

Image metadata. An increasing number of cameras these days have (GPS-based) geotagging built-in and enabled by default. Even more awesome is the Eye-Fi card, which automatically uploads pictures you snap to Flickr (or any of dozens of other image sharing websites you can pick from) by connecting to available WiFi access points nearby. Some versions of the card do automatic geotagging in addition.

If you regularly post pseudonymously to (say) Flickr, then the geolocations of your pictures will probably reveal prominent clusters around the places you frequent, including your home and work. This can be combined with auxiliary data to tie the pictures to your identity.

Now let us turn to the other major question: what are the sources of auxiliary data that might link location pairs to identities? The easiest approach is probably to buy data from Acxiom, or another provider of direct-marketing address lists. Knowing approximate home and work locations, all that the attacker needs to do is to obtain data corresponding to both neighborhoods and do a “join,” i.e, find the (hopefully) unique common individual. This should be easy with Axciom, which lets you filter the list by  “DMA code, census tract, state, MSA code, congressional district, census block group, county, ZIP code, ZIP range, radius, multi-location radius, carrier route, CBSA (whatever that is), area code, and phone prefix.”

Google and Facebook also know my home and work addresses, because I gave them that information. I expect that other major social networking sites also have such information on tens of millions of users. When one of these sites is the adversary — such as when you’re trying to browse anonymously — the adversary already has access to the auxiliary data. Google’s power in this context is amplified by the fact that they own DoubleClick, which lets them tie together your browsing activity on any number of different websites that are tracked by DoubleClick cookies.

Finally, while I’ve talked about image data being the target of de-anonymization, it may equally well be used as the auxiliary information that links a location pair to an identity — a non-anonymous Flickr account with sufficiently many geotagged photos probably reveals an identifiable user’s home and work locations. (Some attack techniques that I describe on this blog, such as crawling image metadata from Flickr to reveal people’s home and work locations, are computationally expensive to carry out on a large scale but not algorithmically hard; such attacks, as can be expected, will rapidly become more feasible with time.)

devicesSummary. A number of devices in our daily lives transmit our physical location to service providers whom we don’t necessarily trust, and who keep might keep this data around or transmit it to third parties we don’t know about. The average user simply doesn’t have the patience to analyze and understand the privacy implications, making anonymity a misleadingly simple way to assuage their concerns. Unfortunately, anonymity breaks down very quickly when more than one location is associated with a person, as is usually the case.

19 comments May 13, 2009

Is Anonymity Research Ethical?

A researcher who is working on writing style analysis (“stylometry”), after reading my post on related de-anonymization techniques, wonders what the positive impact of such research could be, given my statement that the malicious uses of the technology are far greater than the beneficial ones. He says:

Sometimes when I’m thinking of an interesting research topic it’s hard to forget the Patton Oswalt line “Hey, we made cancer airborne and contagious! You’re welcome! We’re science: we’re all about coulda, not shoulda.”

This was my answer:

To me, generic research on algorithms always has a positive impact (if you’re breaking a specific website or system, that’s a different story; a bioweapon is a whole different category.) I do not recognize a moral question here, and therefore it does not affect what I choose to work on.

My belief that the research will have a positive impact is not at odds with my belief that the uses of the technology are predominantly evil.  In fact, the two are positively correlated. If we’re talking about web search technology, if academics don’t invent it, then (benevolent) companies will. But if we’re talking about de-anonymization technology, if we don’t do it, then malevolent entities will invent it (if they haven’t already), and of course, keep it to themselves. It comes down to a choice between a world where everyone has access to de-anonymization techniques, and hopefully defenses against it, versus one in which only the bad guys do. I think it’s pretty clear which world most people will choose to live in.

I realize I lean toward the “coulda” side of the question of whether Science is—or should be—amoral. Someone like Prof. Benjamin Kuipers here at UT seems to be close to the other end of the spectrum: he won’t take any DARPA money.

Part of the problem with allowing morality to affect the direction of science is that it is often arbitrary. The Patton Oswalt quote above is a perfect example: he apparently said that in response to news of science enabling a 63 year old woman to give birth. The notion that something is wrong simply because it is not “natural” is one that I find most repugnant. If the freedom of a 63 year old woman to give birth is not an important issue to you, let me note that more serious issues such as stem cell research, that could save lives, fall under the same category.

Going back to anonymity, it is interesting that tools like Tor face much criticism, but for enabling the anonymity of “bad” people rather than breaking the anonymity of “good” people. Who is to be the arbiter of the line between good and bad? I share the opinion of most techies that Tor is a wonderful thing for the world to have.

There are many sides to this issue and many possible views. I’d love to hear your thoughts.

8 comments April 9, 2009

De-anonymizing Social Networks

Our social networks paper is finally officially out! It will be appearing at this year’s IEEE S&P (Oakland).

Download: PDF | PS | HTML

Please read the FAQ about the paper.

Abstract:

Operators of online social networks are increasingly sharing potentially sensitive information about users and their relationships with advertisers, application developers, and data-mining researchers. Privacy is typically protected by anonymization, i.e., removing names, addresses, etc.

We present a framework for analyzing privacy and anonymity in social networks and develop a new re-identification algorithm targeting anonymized social-network graphs. To demonstrate its effectiveness on real-world networks, we show that a third of the users who can be verified to have accounts on both Twitter, a popular microblogging service, and Flickr, an online photo-sharing site, can be re-identified in the anonymous Twitter graph with only a 12% error rate.

Our de-anonymization algorithm is based purely on the network topology, does not require creation of a large number of dummy “sybil” nodes, is robust to noise and all existing defenses, and works even when the overlap between the target network and the adversary’s auxiliary information is small.

The HTML version was produced using  my Project Luther software, which in my opinion produces much prettier output than anything else (especially math formulas). Another big benefit is the handling of citations: it automatically searches various bibliographic databases and adds abstract/bibtex/download links and even finds and adds links to author homepages in the bib entries.

I have never formally announced or released Luther; it needs more work before it can be generally usable, and my time is limited. Drop me a line if you’re interested in using it.

18 comments March 19, 2009

Anonymous Data Collection: Lessons from the A-Rod Affair

Recently, the Alex Rodriguez steroid controversy has been in the news. The aspect that interests me is the manner in which it came to attention: A-Rod provided a urine sample as part of a supposedly anonymous survey of Major League Baseball players in 2003, the goal of which was to determine if more than 5% of players were using banned substances. When Federal agents came calling, the sample turned out to be not so anonymous after all.

The failure of anonymity here was total–the testing lab simply failed to destroy the samples or even take the labels off them, and the Players’ Union, which conducted the survey, failed to call the lab and ask them to do so during the more than one-week window that they had before the subpoena was issued.

However, there are a number of ways in which things could have gone wrong even if one or more of the parties had followed proper procedure. None of the scenarios below result in as straightforward an association between player and steroid use as we have seen. On the other hand, they can be just as damaging in the court of public opinion.

  • If the samples were not destroyed, but simply de-identified, DNA can be recovered even after years, and the DNA can be used to match the player to the sample. You might argue the feds can’t easily get hold of players’ DNA to run such a matching, but once the association between drug test result and DNA has been made, it is a sword of Damocles hanging over the player’s head (note that A-Rod’s drug test happened six years ago.) The trend in recent years has been toward increased DNA profiling and bigger and bigger databases, and unlabeled samples therefore pose a clear danger.
  • If the samples are destroyed, and the test results are stored in de-identified form, anonymity could still be compromised. A drug test measures the concentrations of a bunch of different chemicals in the urine. It is likely that this results in a “profile” that is characteristic of a person–just like a variety of other biometric characteristics. If the same player, having stopped the use of banned substances, provides another urine sample, it is possible that this profile can be matched to the old one based on the fact that most of the urine chemicals have not changed in concentration. It is an interesting research question to see how stable the “profiles” are, and what their discriminatory power is.
  • Even more sophisticated attacks are possible. Let’s say that participant names are known, but other than that the only thing that’s released is a single statistic: the percentage of players that tested positive. Now, if the survey is performed on a regular basis, and a certain player (who happens to use steroids) participates only some of the time, the overall statistic is going to be slightly higher whenever that player participates. In spite of confounding factors, such as the fact that other players might also drop in and out, statistical techniques can be used to tease out this correlation. 

    This might sound like a tall order at first, but it is a proven attack strategy. The technique was used recently in a PLoS Genetics paper to identify if an individual had contributed DNA to an aggregate sample of hundreds of individuals. 

    I performed a quick experiment, assuming that there are 1,000 players in the sample, of which 100 participate half the time (the rest participate all the time). 5% of the players dope, and each player either dopes throughout the study period or not at all. Testing is done every 3 months; the list of participants in each wave of the survey is known, as well as the percentage of players who tested positive in each wave. I found that after 3 years, there is enough information to identify 80% of the cheating players who participate irregularly. (Players who participate regularly are clearly safe.) 

    [Technical note: that's an equal error rate of 20%; i.e, 20% of the cheating players are not accused, and 20% of the accused are innocent. There is a trade-off between the two numbers, as always; if a higher accuracy is required, say only 10% of accused players are innocent, then 65% of the cheating players can be identified.]

  • When applicable, a combination of the above techniques such as matching de-identified profiles across different time-periods of a survey (or different surveys) can greatly increase the attacker’s potential.

The point of the above scenarios is to convince you that you can never, ever be certain that the connection between a person and their data has been definitively severed. Regular readers of this blog will know that this is a recurring theme of my research. The quantity of data being collected today and the computational power available have destroyed the traditional and ingrained assumptions about anonymity. Well-established procedures have been shown to be completely inadequate, and it is far from clear that things can be fixed. Anyone who cares about their privacy must be vigilant against giving up their data under false promises of anonymity.

Add comment February 19, 2009

De-anonymizing the Internet

I’ve been thinking about this problem for quite a while: is it possible to de-anonymize text that is posted anonymously on the Internet by matching the writing style with other Web pages/posts where the authorship is known? I’ve discussed this with many privacy researchers but until recently never written anything down. When someone asked essentially the same question on Hacker News, I barfed up a stream of thought on the subject :-) Here it is, lightly edited.

Each one of us has a writing style that is idiosyncratic enough to have a unique “fingerprint”. However, it is an open question whether it can be efficiently extracted.

The basic idea for constructing a fingerprint is this. Consider two words that are nearly interchangeable, say ’since’ and ‘because’. Different people use the two words in a differing proportion. By comparing the relative frequency of the two words, you get a little bit of information about a person, typically under 1 bit. But by putting together enough of these ‘markers’, you can construct a profile.

The beginning of modern, rigorous research in this field was by Mosteller and Wallace in 1964: they identified the author of the disputed Federalist papers, almost 200 years after they were written (note that there were only three possible candidates!). They got on the cover of TIME, apparently. Other “coups” for writing-style de-anonymization are the identification of the author of Primary Colors, as well as the unabomber (his brother recognized his style, it wasn’t done by statistical/computational means).

The current state of the art is summarized in this bibliography. Now, that list stops at 2005, but I’m assuming there haven’t been earth-shattering changes since then. I’m familiar with the results from those papers; the curious thing is that they stop at corpuses of a couple hundred authors or so — i.e, identifying one anonymous poster out of say 200, rather than a million. This is probably because they had different applications in mind, such as identification within a company, instead of Internet-scale de-anonymization. Note that the amount of information you need is always logarithmic in the potential number of authors, and so if you can do 200 authors you can almost definitely push it to a few tens of thousands of authors.

The other interesting thing is that the papers are fixated with ‘topic-free’ identification, where the texts aren’t about a particular topic, making the problem harder. The good news is that when you’re doing this Internet-scale, nobody is stopping you from using topic information, making it a lot easier.

So my educated guess is that Internet-scale writing style de-anonymization is possible. However, you’d need fairly long texts, perhaps a page or two. It’s doubtful that anything can be done with a single average-length email.

Another potential de-anonymization strategy is to use typing pattern fingerprinting (keystroke dynamics), i.e, analyzing the timing between our keystrokes (yes, this works even for non-touch typists.) This is already used in commercial products as an additional factor in password authentication. However, the implications for de-anonymization have not been explored, and I think it’s very, very feasible. i.e, if google were to insert javascript into gmail to fingerprint you when you were logged in, they could use the same javascript to identify you on any web page where you type in text even if you don’t identify yourself. Now think about the de-anonymization possibilities you can get by combining analysis of writing style and keystroke dynamics…

By the way, make no mistake: the malicious uses of this far overwhelm the benevolent uses. Once this technology becomes available, it will be very hard to post anonymously at all. Think of the consequences for political dissent or whistleblowers. The great firewall of China could simply insert a piece of javascript into every web page, and poof, there goes the anonymity of everyone in China.

It think it’s likely that one can build a tool to protect anonymity by taking a chunk of writing and removing your fingerprint from it, but it will need a lot of work, and will probably lead to a cat-and-mouse game between improved de-anonymization and obfuscation techniques. Note the caveats, however: most ordinary people will not have the foreknowledge to find and use such a tool. Second, think of all the compromising posts — rants about employers, accounts from cheating spouses, political dissent, etc. — that have already been written. The day will come when some kid will download a script, let a crawler loose on the web, and post the de-anonymized results for all to see. There will be interesting consequences.

If you’re interested in working on this problem–either writing style analysis for breaking anonymity or obfuscation techniques for protecting anonymity–drop me a line.

16 comments January 15, 2009

Previous Posts


Me, elsewhere

Get notified

Be notified when there's a new post — subscribe to the feed, follow me on twitter or use the email subscription box below.

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Recent comments

Tags

a-rod academia algorithm algorithms anonymity author recognition censorship conference data de-anonymization DNA DNA profiling eccentricity entropy ethics facebook forensics free speech FTC genome Google google buzz Google docs graph isomorphism history stealing Internet k-anonymity law lending club livejournal location meta netflix policy privacy privacy by design privacy policy re-identification social network analysis social networks stylometry theory ubercookies web browsers web security