Posts Tagged social networks

Ubercookies Part 2: History Stealing meets the Social Web

Recap. In the previous article I introduced ubercookies — techniques that websites can use to de-anonymize visitors. I discussed a recent paper that shows how to use history stealing along with social network group membership information to find the visitor’s identity, and I promised a stronger variant of the attack.

The observation that led me to the attack I’m going to describe is simple: social networking isn’t just about social networks — the whole web has gone social. It’s a view that you quickly internalize if you spend any time hanging out with Silicon Valley web entrepreneurs :-)

Let’s break the underlying principle of the identity-stealing attack down to its essence:

A user leaves a footprint whenever their interaction with a specific web page is recorded publicly.

De-anonymization happens when the attacker can tie these footprints together into “trails” that can then be correlated with the user’s browser history. Efficiently querying the history to identify multiple points on the trail is a challenging problem to solve, but in principle de-anonymization is possible as long as the user’s actions on different web pages happen under the same identity.

Footprints can be tied together into trails as long as all the interactions happen under the same identity. There is no need for the interactions to be on the same website.

There are two major ways in which you can interact with arbitrary websites under a unified identity, both of which are defining principles of the social web. The first is federated identity, which means you can use the same identity provider wherever you go. This is achieved through OpenID and similar mechanisms like Facebook Connect. The second is social sharing: whenever you find something interesting anywhere on the web, you feed it back to your social network.

Now let’s examine the different types of interactions in more detail.

A taxonomy of interaction on the social web.

0. The pre-social web had no social networks and no delegated identity mechanism (except for the failed attempt by Microsoft called Passport). Users created new identities on each website, authenticated via site-specific usernames and passwords to each site separately. The footprints on different sites cannot be tied together; for practical purposes there are no footprints.

1. Social networks: affiliation. In social networks, users interact with social objects and leave footprints when the actions are public. The key type of interaction that is useful for de-anonymization is the expression of affiliation: this refers to not just the group memberships studied in the recent Wondracek et al. paper, but also includes

  • memberships of fan pages on Facebook
  • “interests” on Livejournal
  • follow relationships and plain old friend relationships on Twitter and other public social networks
  • subscriptions to Youtube channels

and so on.

All of these interactions, albeit very different from the user perspective, are fundamentally the same concept:

  • you “add yourself” to or affiliate yourself with some object on a social network
  • this action can be publicly observed
  • you almost certainly visited a URL that identifies the object before adding it.
As you can imagine, these actions leave a trail.

2. The social web: sharing. When you find a page you like — any page at all — you can import it or “share” it to your social stream, on Facebook, Twitter, Google Buzz, or a social bookmarking site like Delicious. The URL of the page is almost certainly in your history, and as long as your social stream is public, your interaction was recorded publicly.


3. The social web: federated identity. When you’re reading a blog post or article on the social web, you can typically comment on it, “like” it, favorite it, rate it, etc. You do all this under your Facebook, Google or other unified identity. These actions are often public and when they are, your footprint is left on the page.

A taxonomy of attacks

The three types of social interactions above give rise to a neat taxonomy of attacks. They involve progressively easier backend processing and progressively more sophisticated history search techniques on the front end. But the execution time on the front-end doesn’t increase, so it is a net win. Here’s a table:

Type of interaction
Backend processing
Type of history URL
Location of footprint
Affiliation Crawling of social network Object in a social network In the social network
Sharing Syndication of social stream(s) from social network Any page In the social network
Federated identity None; optional crawling Any page On the page

.

1. Better use of affiliation information. The Wondracek et al. paper makes use of only group membership. One natural reason to choose groups is that there are many groups that are large, with thousands of members, so it gives us a reasonably high chance that by throwing darts in the browser history we will actually hit a few groups that the user has visited. On the other hand, if we try to use the Facebook friend list, for example, hoping to find one of the user’s friends by random chance, it probably won’t work because most users have only a few hundred friends.

But wait: many Twitter users have thousands or even millions of followers. These are known as “hubs” in network theory. Clearly, the attack will work for any kind of hubs that have predictable URLs, and users on Twitter have even more predictable URLs (twitter.com/username) than groups on various networks. The attack will also work using Youtube favorites (which show up by default on the user’s public profile or channel page) and whatever other types of affiliation we might choose to exploit, as long as there are “hubs” — nodes in the graph with high degree. Already we can see that many more websites are vulnerable than the authors envisaged.

2. Syndicating the social stream: my Delicious experiment.

The interesting thing about the social stream is that you can syndicate the stream of interactions, rather than crawling. The reasons why syndication is much easier than crawling are more practical than theoretical. First, syndicated data is intended to be machine readable, and is therefore smaller as well as easier to parse compared to scraping web pages. Second, and more importantly, you might be to get a feed of the entire site-wide activity instead of syndicating each user’s activity stream separately. Delicious allows global syndication; Twitter plans to open this “firehose” feature to all developers soon.

Another advantage of the social stream is that everything is timestamped, so you can limit yourself to recent interactions, which are more likely to be in the user’s history.

Using the delicious.com dataset made available by DAI-labor (a log of all bookmarking activity on delicious.com over several years), I did a simulated experiment using 3 months worth of data: assuming that users keep their history around for 3 months, do in fact visit every link they post on delicious, how many users would a hypothetical history stealing attack be able to identify? I had a pretty good success rate: about 60% of the users who had shared at least 2 links in the 3-month period, or about 300,000 users. This takes at most 4000-5000 Javascript history queries.

Needless to say, once Twitter opens up its firehose, Twitter users (who are far more numerous than delicious users) would also be susceptible to the same technique.

This attack is not possible to fix via server-side URL randomization. It can also be made to work using Facebook, Google Buzz, and other sharing platforms, although the backend processing required won’t be as trivial (but probably no harder than in the original attack.)

3. A somewhat random walk through the history park.

And now for an approach that potentially requires no backend data collection, although it is speculative and I can’t guess what the success rate would be. The attack proceeds in several steps:

  1. Identify the user’s interests by testing if they’ve visited various popular topic-specific sites. Pick one of the user’s favorite topics. Incidentally, a commenter on my previous post notes he is building exactly this capability using topic pages on Wikipedia, also with the goal of de-anonymization!
  2. Grab a list of the top blogs on the topic you picked from one of the blog directories. Query the history to see which of these blogs the user reads frequently. It is even possible to estimate the level of interest in a blogs by looking at the fraction of the top/recent posts from that blog that the user has visited. Pick a blog that the user seems to visit regularly.
  3. Look for evidence of the user leaving comments on posts. For example, on Blogger, the comment page for a post has the URL http://www.blogger.com/comment.g?blogID=<blogid>&postID=<postid>.
  4. Once you find a couple of posts where it looks like the user made a comment, scrape the list of people who commented on it, find the intersection. (Even a single comment might suffice; as long as you have a list of candidates, you easily verify if it’s one of them by testing user-specific URLs. More below.)
  5. Depending on the blogging platform, you might even be able to deduce that the user responded (or intended to respond) to a specific comment. For example, On wordpress you have the pattern http://<blogname>.wordpress.com/<postname>/?replytocom=<commentid>#respond. If you get lucky and find one of those patterns, that makes things even easier.

If at first you don’t succeed, pick a different blog and repeat.

I suspect that the most practical method would be to use a syndicated activity stream from a social network, but also to use the heuristics presented above to more efficiently search through the history.

Epilogue: Identity.

Not only has there been a movement towards a small number of identity providers on the web, there are many aggregators out there that have sprung up in order to automatically find the connections between identities across the different identity providers, and also connect online identities to physical-world databases. As Pete Warden notes:

One of the least-understood developments of the last few years is the growth of databases of personal information linked to email addresses. Rapleaf is probably the leader in this field, but even Flickr lets companies search their API for users based on an email address.

I ran my email address through his demo script and it is quite clear that virtually all of my online identities have been linked together. This is getting to be the norm; as a consequence, once an attacker gets any kind of handle on you, they can go “identity hopping” and find out a whole lot more about you.

This is also the reason that once the attacker can make a reasonable guess at the visitor’s identity, it’s easy to verify the guess. Not only can they look for user-specific URLs in your history to confirm the guess (described in detail in the Wondracek et al. paper), but all your social streams on other sites can also be combined with your history to corroborate your identity.

Up next in the Ubercookies series: So that’s pretty bad. But it’s going to get worse before it can get better :-) In the next article, I will describe an entirely different attack strategy to get at your identity by exploiting a bug in a specific identity provider’s platform.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

3 comments February 19, 2010

Cookies, Supercookies and Ubercookies: Stealing the Identity of Web Visitors

Synopsis. Highly sticky techniques called supercookies for tracking web visitors are becoming well known. But the state of the art has in fact advanced beyond tracking, allowing a website to learn the identity of a visitor. I will call these techniques ubercookies; this article describes one such recently discovered technique. Future articles in this series will describe even more powerful variants and discuss the implications.

Cookies. Most people are aware that their web browsing activity over time and across sites can be tracked using cookies. When you are being tracked, it can be deduced that the same person visited certain sites at certain times, but the sites doing the tracking don’t know who you are, i.e., you name, etc., unless you choose to tell them in some way, such as by logging in.

Cookies are easy to delete, and so there’s been a big impetus in the Internet advertising industry to discover and deploy more robust tracking mechanisms.

Supercookies. You may surprised to find just how helpless a user is against a site (or more usually, a network of sites) that is truly determined to track them. There are Flash cookies, much harder to delete, some of which respawn the regular HTTP cookies that you delete. The EFF’s Panopticlick project demonstrates many “browser fingerprinting” methods which are more sophisticated. (Jonathan Mayer’s senior thesis contained a smaller-scale demonstration of some of those techniques).

A major underlying reason for a lot of these problems is that any browser feature that allows a website to store “state” on the client can be abused for tracking, and there are a bewildering variety of these. There is a great analysis in a paper by my Stanford colleagues. One of the points they make is that co-operative tracking by websites is essentially impossible to defend against.

Ubercookies: history stealing. Now let’s get to the scary stuff: uncovering identity. History stealing or history sniffing is an unintended consequence of the way the web is designed; it allows a website to learn which URLs you’ve been to. While a site can’t simply ask your browser for a list of visited URLs, it can ask “yes/no” questions and your browser will faithfully respond. The most common way of doing this is by injecting invisible links into the page using Javascript and exploiting the fact that the CSS link color attribute depends on whether the link has been visited or not.

History stealing has been known for a decade, and browser vendors have failed to fix it because it cannot be fixed without sacrificing some useful functionality (the crude way is to turn off visited link coloring altogether; a subtler solution is SafeHistory). Increasingly worse consequences have been discovered over the years: for example, a malicious site can learn which bank you use and customize a phishing page accordingly. But a paper (full text, PDF) coming out at this year’s IEEE S&P conference at Oakland takes it to a new level.

Identity. Let’s pause for a second and think about what finding your identity means. In the modern, social web, social network accounts have become our de-facto online identities, and most people reveal their name and at least some other real-world information about ourselves on our profiles. So if the attacker can discover the URL of your social network profile, we can agree that he has identified you for all practical purposes. And the new paper shows how to do just that.

The attack relies on the following observations:

  1. Almost all social networking sites have some kind of “group” functionality: users can add themselves to groups.
  2. Users typically add themselves to multiple groups, at least some of which are public.
  3. Group affiliations, just like your movie-watching history and many other types of attributes, are sufficient to fingerprint a user. There’s a high chance there’s no one else who belongs to the same set of groups that you do (or is even close). [Aside: I used this fact to show that Lending Club data can be de-anonymized.]
  4. Users who belong to a group are likely to visit group-specific URLs that are predictable.

Put the above facts together, and the attack emerges: the attacker (an arbitrary website you visit, without the co-operation of whichever social network is used as an attack enabler) uses history stealing to test a bunch of group-related URLs one by one until he finds a few (public) groups that the anonymous user probably belongs to. The attacker has already crawled the social network, and therefore knows which user belongs to which groups. Now he puts two and two together: using the list of groups he got from the browser, he does a search on the backend to find the (usually unique) user who belongs to all those groups.

Needless to say, this is a somewhat simplified description. The algorithm can be easily modified so that it will work even if some of the groups have disappeared from your history (say because you clear it once in a while) or if you’ve visited groups you’re not a member of. The authors demonstrated that the attack with real users on the Xing network, and also showed theoretically that it is feasible on a number of other social networks including Facebook and Myspace. It takes a few thousand Javascript queries and runs in a few seconds on modern browsers, which makes it pretty much surreptitious.

Fallout. There are only two ways to try to fix this. The first is for all the social networking sites to change their URL patterns by randomizing them so that point 4 above (predictable URL identifying that you belong to a group) is no longer true. The second is for all the browser vendors to fix their browsers so that history stealing is no longer possible.

The authors contacted several of the social networks; Xing quickly implemented the URL randomization fix, which I find surprising and impressive. Ultimately, however, Xing’s move will probably be no more than a nice gesture, for the following reason.

Over the last few days, I have been working on a stronger version of this attack which:

  • can make use of every URL in the browser history to try and identify the user. This means that server-side fixes are not possible, because literally every site on the web would need to implement randomization.
  • avoids the costly crawling step, further lowering the bar to executing the attack.

That leaves browser-based fixes for history stealing, which hasn’t happened in the 10 years that the problem has been known. Will browsers vendors finally accept the functionality hit and deal with the problem? We can hope so, but it remains to be seen.

In the next article, I will describe the stronger attack and also explain in more detail why your profile page on almost any website is a very strong identifier.

Thanks to Adam Bossy for reviewing a draft.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

23 comments February 18, 2010

Privacy is not Access Control (But then what is it?)

In my previous article on the Google Buzz fiasco, I pointed out that the privacy problems were exacerbated by the fact that the user interface was created by programmers. In this post I will elaborate on that theme and provide some constructive advice on privacy-conscious design, especially for social networking.

The problem I’m addressing is that as far as computer scientists and computer programmers are concerned, privacy is a question of access control, i.e., who is allowed to look at what. Unfortunately, in the real world, that is only a tiny part of what privacy is about. Here are three examples to make my point:

1. Dummy cameras. Consider a thought experiment: suppose the government installed a bunch of cameras all over a public park along with prominent signs announcing 24×7 surveillance. The catch, however, is that the cameras have not been turned on. Has anyone’s privacy been violated?

From the computer science perspective, the answer is no, because no one is actually being observed, nothing is being recorded and no data is being generated. But common sense tells us that something is wrong with that answer. The cameras cause people considerable discomfort. The surveillance, real or imaginary, changes their behavior.

This hypothetical scenario is adapted from Ryan Calo’s paper, which analyzes in detail the “sensation of being observed.”

2. Aggregation changes the equation. Remember the uproar when Facebook released News Feed? No new information was revealed to your friends that wasn’t accessible to them before; it was just that the News Feed made it dramatically easier to observe all your activities on the site.

Of course, it goes both ways: the technology in turn changed people’s expectations; it is now hard to imagine not having a feed-like system, whether on Facebook or another social network. Nevertheless, I often see people putting something into their profile, deciding a few moments later that they didn’t want to share it after all, and realizing that it was too late because the information has already been broadcast to their friends.

3. Everyone-but-X access control, which I described in an earlier article, shows in a direct way how access control fails to capture privacy requirements. From the traditional CS security perspective, the ability for a user to make something visible to “everyone but X” is meaningless: X can always create a fake account to get around it.

But a use-case should hopefully immediately convince you that everyone-but-X is a good idea: your sibling is on your friends list and you want to post about your sex life. It’s not that you want to prevent X from having access to your post, but rather that both of you prefer that X didn’t have access to it.

Access control is not the goal of privacy design. It is at best one of many tools. Rather, human behavior is key. The dummy cameras were bad because they affected the behavior of people in a detrimental way. News feed was bad because it introduced major new privacy consequences for the behaviors that people were accustomed to on the site. (However, I would argue that the dramatic increase in usefulness trumped the privacy drawbacks.) Everyone-but-X privacy is good because it allows people to carry over to the online setting behaviors that they are used to in the real world.

It is impossible to fully analyze the privacy consequences of a design decision without studying its impact on actual user behavior. There is no theoretical framework to ensure that a design decision is safe — user testing is essential. Going back to Google Buzz, a beta period or a more gradually phased roll-out would have undoubtedly been better.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

8 comments February 13, 2010

Google Buzz, Social Norms and Privacy

Another day, another privacy backlash — this time with Google Buzz. What’s new? Lots, as it turns out.

There are many minor ways in which Google Buzz fails, both with regard to privacy and otherwise. For example, I’ve been posting my Buzz updates publicly because the user interface for posting it to a restricted group is horribly clunky. (Post only to my followers? What’s the point of that, when anyone can start following me?! Make it easy to post to a group that I have control over!)

But the major privacy SNAFU, as you’ve probably heard, is auto-follow. Google automatically makes public a list of the top 25 or so people you’ve corresponded with in Gmail or Google talk. Worse, the button to turn this “feature” off resides in your Google-wide profile, making it unnecessarily hard to find because it isn’t within the Buzz interface itself.

This is a classic example of what happens when the user interface is created by programmers instead of designers, a recurring problem for Google. Programmers partition features in a way that fits the computer’s natural data model, rather than the user’s natural mental model.

But getting back to privacy, it is a certainty in a statistical sense that Google outed a few affairs and other secret relationships. For even if you were yourself savvy enough to turn off the public display of your top correspondents, there’s a good chance the other party wasn’t, and might not have turned it off on their end.

When I enabled Buzz and realized what had happened, something changed for me in my head. I’d always regarded email and chat as a private medium. But that’s not true any more; Google forced me to discard my earlier expectations. Even if Google apologizes and retracts auto-follow (not that I think that’s likely), the way I view email has permanently changed, because I can’t be sure that it won’t happen again. I lost some of the privacy expectation that I had of not only Google’s services, but of email and chat in general, albeit to a lesser extent.

What I’ve tried to do in the preceding paragraphs is show in a step-by-step manner how Google’s move changed social norms. The larger players like Google and Microsoft have been very conservative when it comes to privacy, unlike upstarts like Facebook. So why did Google enable auto-follow? By all accounts, their hand was forced: they needed a social network to compete with Facebook and Twitter. Given the head-start that their competitors have, the only real way to compete was to drag their users into participating.

Google ended up changing society’s norms in a detrimental way in order to meet their business objectives. This has become a recurring theme (c.f. the section on Facebook in that article). I don’t think there is any possibility of putting the genie back in the bottle; this trend will only continue. This time it was about who I email; soon my expectations about the contents of emails themselves will probably change.

I believe that in the long run, the only “stable equilibrium” of privacy norms, as it were, would be for everyone to simply assume that everything they type into a computer will be publicly visible either instantly or at some point in the future, outside their control. That is not necessarily as terrible as it may seem. Nonetheless, society will take a long time to get there. Until then, the best we can do is push back against intrusions as much as possible, delaying the inevitable, giving ourselves enough time to adapt.

Do your part to fight back against auto-follow. Let Google know how you feel. Blog about it or leave a comment.

Updates

  1. A New York Times blogger picked up the controversy.
  2. Joe Bonneau has an analysis of users’ confused reactions.
  3. Google has announced that it is rolling out some user-interface changes in response to the feedback. That is better than before, but the default is still public auto-follow.
  4. The horror stories due to auto-follow have begun.
  5. I have a new article with advice on privacy-conscious design.
  6. Google decided to nix auto-follow after all! Awesome.

Thanks to Joe Bonneau for reviewing a draft of this article.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

20 comments February 11, 2010

Livejournal Done Right: The Case for a Social Network with Built-in Privacy

Is it time to give up on privacy in social networking? I argue that the exact opposite is true. Impatient readers can skip to the bullet-point summary at the end.

Based on my work on de-anonymizing social networks with Shmatikov, and other research such as Bonneau & Preibusch’s survey of the dismal state of privacy in social networks, many people have concluded that it is time to give up on social networking privacy. In my opinion, this couldn’t be farther from the truth.

Being a hard-headed pragmatist (at least by the lax standards of academia (-:), I will make the case that there is a market for a social networking site designed from the ground-up with privacy in mind, as opposed to privacy being tagged on piecemeal in reaction to PR incidents.

It would seem that a good place to start would be to look at existing social networks with designed-in privacy, and see how they have fared. Unfortunately, researchers are still hammering out exactly what that would look like, and there are no real examples in the wild. In fact, part of the reason for this post is to flesh out some principles for designed-in privacy. So I will use a definition based on privacy outcomes instead:

The privacy strength of a social network is the extent to which its users share sensitive information with one another.

Viewed from this perspective, there is only one widely-used social network (at least in the U.S.) that has strong privacy, one that stands out from all the rest: LiveJournal.

While Facebook’s privacy controls are more technologically sophisticated, there is little doubt that far more revelations of a private nature are made on LiveJournal. This discrepancy is central to the point I want to make: achieving privacy is not just about technological decisions.

There is one overarching reason for LiveJournal’s privacy success: They make it (relatively) easy for users to communicate their mental access control rules to the system. In my opinion, this should the the single most important privacy goal of a social network; the technical problem of implementing those access control rules is secondary and much easier.

On Livejournal, the goal is achieved largely due to two normative user behaviors:

  • Friending is not indiscriminate (see below).
  • Users actually use friend lists for access control.

Herding users into these behaviors is far from easy, and LiveJournal stumbled there through a variety of disparate design decisions, some wise, some not so wise, some that worked against their interest in the long run, and some downright bizarre.

  • Friendship is not mutual. While in practice over 90% of friendships are reciprocated, the difference crucially captures the asymmetric nature of trust.
  • The site is insular — it plays poorly with search engines; RSS support has been way behind other blog platforms.
  • Privacy settings are highly visible, rather than being tucked away in a configuration page. Just a couple of examples:
    • there is a privacy-level dropdown menu on the post-new-entry page.
    • when you add a friend, you are prompted to add them to one or more friend lists.
  • Weak identity. The site does not require or encourage a user to use their real name. Many users choose to hide their real-life identity from everyone except their friend-list.
  • Livejournal doesn’t inform users when they are friended. From the privacy perspective, this is a feature(!) rather than a bug — it decreases the embarrassment of an unreciprocated friending by letting both users pretend that the user who was friended didn’t notice (even though most regular users use external tools to receive such notifications.). The social norms around friending are in general far more complex than on Facebook, and there is a paper that analyzes them.

As you may have gathered from the above, social norms have a huge impact on the privacy outcome of a site; this explains both why privacy is about more than technology, as well as why privacy can never be achieved as an afterthought — because norms that have evolved can hardly ever be undone. Regrettably, but unsurprisingly, the CS literature on social network privacy has been largely blind to this aspect. (Fortunately, economists, philosophers, some hard-to-categorize researchers, and needless to say, sociologists and legal scholars have been researching social network privacy.)

Returning to my main thesis, I believe that privacy has been the central selling-point of Livejournal, even though it was never marketed to users in those terms. The privacy-centric view explains why the userbase is so notoriously vocal, why the site is able to get users to pay, why they have a huge fanfic community, much of it illegal, and why Livejournal users find it impossible to migrate to other mainstream social networks, which all lack any semblance of the privacy norms that exist on Livejournal.

Livejournal is dying, at least in the U.S., which I believe is largely due to erratic design decisions. While the decay of the site has been obvious to most users (who have seen the frequency of new posts basically fall off a cliff in the last few months), I don’t have concrete data on post frequency. Fortunately, it is not essential to the point I’m making, which is that Livejournal got a few things right but also made a lot of mistakes. We now know a lot more about privacy by design in social networks than we did a decade ago, and it is possible to do much better by starting from scratch. There is now a huge unfulfilled need in the market for someone to take a crack at.

Finally, I’m going to throw in two examples of design decisions that Livejournal (or any other network) never implemented but I believe would be hugely beneficial in achieving positive privacy outcomes:

“Everyone-but-X” access control. This is an example of a whole class of access control primitives that make no sense from the traditional computer science security perspective. If an item is visible to every logged-in user except X, X can always create a fake (“sybil”) account to get around it.

However, let me give you one simple example that I hope will immediately convince you that everyone-but-X is a good idea: your sibling is on your friends list and you want to post about your sex life. It’s not so much that you want to prevent X from having access to your post, but rather that both of you prefer that X didn’t have access to it. The relationship is not adversarial. Extrapolating a little bit, most users can benefit from everyone-but-X privacy in one context or another, but amazingly, no social network has thought of it.

The problem here is that traditional CS security theory lacks even the vocabulary to express what’s going on here. Fortunately, researchers are wising up to this, and a new paper that will be presented at ESORICS later this month argues that we need a new access control model to reason about social network privacy, and presents one that is based on Facebook (I really like this paper).

Stupidly easy friend lists. Having to manually manage friend-lists puts it beyond the patience level of the average user, and offers no hope of getting users who already have several hundred uncategorized friends to start categorizing. But technology can help: I’ve written about automated friend-list clustering and classification before.

Summary. As promised, in bullet points:

  • Livejournal is the only major social network whose users regularly share highly private material.
  • Livejournal achieved this largely because they made it easy for users to communicate their mental access control rules to the system.
  • To habituate users into doing this, social norms are crucial. They matter more than technology in affecting privacy outcomes.
  • Designing privacy is therefore largely about building the right tools to get the right social norms to evolve.
  • Livejournal doesn’t seem to have a bright future. Besides, they made many mistakes and never realized their full potential.
  • Therefore, privacy-conscious users form a large and currently severely underserved segment of the social networking audience.
  • The lessons of Livejournal and recent research can help us design privacy effectively from the ground up. The time is right, and the market is ripe.

Final note. I will be presenting the gist of this essay (preceded by a survey of the academic attempts at privacy by design) at the Social Networking Security Workshop at Stanford this Friday.

Some of the ideas in this post were inspired by these essays by Matthew Skala.

21 comments September 9, 2009

De-anonymizing Social Networks

Our social networks paper is finally officially out! It will be appearing at this year’s IEEE S&P (Oakland).

Download: PDF | PS | HTML

Please read the FAQ about the paper.

Abstract:

Operators of online social networks are increasingly sharing potentially sensitive information about users and their relationships with advertisers, application developers, and data-mining researchers. Privacy is typically protected by anonymization, i.e., removing names, addresses, etc.

We present a framework for analyzing privacy and anonymity in social networks and develop a new re-identification algorithm targeting anonymized social-network graphs. To demonstrate its effectiveness on real-world networks, we show that a third of the users who can be verified to have accounts on both Twitter, a popular microblogging service, and Flickr, an online photo-sharing site, can be re-identified in the anonymous Twitter graph with only a 12% error rate.

Our de-anonymization algorithm is based purely on the network topology, does not require creation of a large number of dummy “sybil” nodes, is robust to noise and all existing defenses, and works even when the overlap between the target network and the adversary’s auxiliary information is small.

The HTML version was produced using  my Project Luther software, which in my opinion produces much prettier output than anything else (especially math formulas). Another big benefit is the handling of citations: it automatically searches various bibliographic databases and adds abstract/bibtex/download links and even finds and adds links to author homepages in the bib entries.

I have never formally announced or released Luther; it needs more work before it can be generally usable, and my time is limited. Drop me a line if you’re interested in using it.

18 comments March 19, 2009

Social Network Analysis: Can Quantity Compensate for Quality?


Science magazine has labeled Christakis and Fowler the “dynamic duo”

Nicholas Christakis of Harvard and James Fowler of UC San Diego have produced a series of ground-breaking papers analyzing the spread of various traits in social networks: obesity, smoking, happiness, and most recently, in collaboration with John Cacioppo, loneliness. The Christakis-Fowler collaboration has now become well-known, but from a technical perspective, what was special about their work?

It turns out that they found a way to distinguish between the three reasons why people who are related in a social network are similar to each other.

  1. Homophily is the tendency of people to seek others who are alike. For example, most of us restrict our dates to smokers or non-smokers, mirroring our own behavior.
  2. Confounding is the phenomenon of related individuals developing a trait because of a (shared) environmental circumstance. For example, people living right next to a McDonald’s might all gradually become obese.
  3. Induction is the process of one individual passing a trait or behavior on to their friends, whether by active encouragement or by setting an example.

Clearly, only induction can cause a trait to actually spread in a social network. To distinguish between the three effects and to prove causality, according to the authors, the key is longitudinal data–data from the same individuals collected over a period of years or decades. All of the works cited above are based on the Framingham Heart Study. This corpus of data is ideally suited in several ways:

  • It contains data from three generations of individuals.
  • Very few of the participants (10 out of over 5,000) dropped out:
  • “Even subjects who migrate out of the town of Framingham (to points throughout the U.S.) remain in the study and, remarkably, come back every few years to be examined and to complete survey forms.”
  • The original study sample comprised the majority of the population of Framingham, which is (presumably) a somewhat closed social network.

This illustrates the traditional way of doing things, using carefully selected high-quality data. With the growth of online social networking websites, however, a radically different approach is gaining prominence. A good example is this Slate article that analyzes the recent “25 random things” Facebook meme using well-known epidemiological models, and concludes that marketers should “introduce a wide variety of schemes into the wild and pray like hell that one of them evolves into a virulent meme.” For a more academic/rigorous example, see the paper “Characterizing Social Cascades in Flickr” (pdf), which looks at how information disseminates through social links.

Many analogies come to mind when comparing the Old School to the New School: the Cathedral vs. the Bazaar, or Britannica vs. Wikipedia. Information in social networking sites is collected through a chaotic, organic, unsupervised process. The set of participants is entirely self-selected. Against these objections stands the indisputable fact that the process produces several orders of magnitude more data at a fraction of the cost.

Despite being only a few years old, online social network analysis has already produced deep insights: the work of Jon Kleinberg springs to mind. But will it supplant the traditional approach? I think so. My hypothesis is that with sufficiently powerful analytical methods, quantity can compensate for noise in the data. Don’t take my word for it: Harvard professor Gary King considers the availability of data from online social networks to be the “most significant turning point in the history of sociology.”

The amount and variety of social network data available to researchers, marketers, etc. is rapidly increasing; there is a detailed survey in my forthcoming paper (at IEEE S&P) on de-anonymizing social networks. In spite of the rather serious privacy concerns that are identified in the paper, the balance of business incentives appears to be towards more openness, and my prediction is that social networks will continue to move in that direction. Facebook alone has an incredible wealth of as-yet untapped data on information flow–recent the feed-focused redesign instantly transformed posted items, group memberships and fan pages into meme propagation mechanisms.

The new approach to social network analysis has benefits other than the quantity of data available. Equally important is the fact that users of social networking sites are not participating in a study; we get to observe their lives directly. The data is thus closer to reality. Furthermore, there is the possibility of studying the population actively rather than passively. For instance, if the goal is to study meme propagation, why not introduce memes into the population? This gives the researcher much greater control over the timing, point of introduction, and content of the memes being studied. Of course, this raises ethical and methodological questions, but they will be worked out in due course.

A third benefit of the new approach is that social network users often express themselves using free form text; utilized properly, this could yield much deeper data than making study participants check boxes on a Likert scale in response to canned questions (such as the now famous “How does it feel to be poor and black?“). The Flickr paper cited above analyzes the tags people use to describe pictures. With more technical sophistication, it should be possible, for example, to apply automated sentiment analysis to blog posts, tweets, etc. to determine how your opinion of a movie or book is influenced by those of your friends.

True, we don’t yet have data spanning several decades, but then things happen on a far faster timescale in the online world. There will always be research questions that fundamentally depend on studying aspects of the real world that are not replicable virtually. By and large, however, I believe the new approach is about to supplant the old. There is still a ways to go in terms of developing the techniques we need for analyzing massive, noisy datasets, but we will get there in a few short years. The Christakis-Fowler papers may soon exemplify the exception, rather than the rule, for social network analysis.

1 comment February 15, 2009

The Fallacy of Anonymous Institutions

The graph below is from the paper “Chains of affection: The structure of adolescent romantic and sexual networks.” The name of the school that the data was collected from is not revealed, and is given the working name “Jefferson High.” It is part of the National Longitudinal Study of Adolescent Health, containing very detailed health information on 100,000 high school students in 140 schools. In 12 of the schools, the entire sexual network was mapped out.

Clearly, the authors felt that concealing the identity of the school is important for protecting the privacy of the participants. It’s not hard to see why: firstly, the aggregate information presented in the study could by itself be unpleasant, especially facts about the prevalence of adolescent sexual activity in a conservative rural town (see below). Second, and more importantly, knowing the identity of the school can lead to further de-anonymization of the individuals in the network.

The graph above is rich enough that a few individuals can identify themselves purely based on the local information available to them, and thus learn things about their neighbors in the graph. A group of individuals getting together will have an even easier time of it. Furthermore, the actual paper provides a richer, temporally ordered version of the graph above.

But even strangers may benefit: depending on how well the temporal information in the sexual graph correlates with other temporal information that may be available, say from Facebook, de-anonymization might be possible with little or no co-operation from the subjects themselves. Soon, I will have more to say about research results on de-anonymizing graphs with loosely correlated external/auxiliary data.

Having established the privacy risk, let’s see how easy it is to re-identify Jefferson High. The authors give us these helpful clues:

“Jefferson High School” is an almost all-white high school of roughly 1000 students located in a mid-sized mid-western town. Jefferson High is the only public high school in the town. The town, “Jefferson City” is over an hour away by car from the nearest large city. Jefferson City is surrounded by beautiful countryside, home to many agricultural enterprises. The town itself is working class, although there remain some vestiges of better times. At one period, the town served as a resort for city dwellers, drawing an annual influx of summer visitors. This is no longer the case, and many of the old resort properties show signs of decay. The community is densely settled. At the time of our fieldwork, students were reacting to the deaths of two girls killed in an automobile accident.

Some further facts presented have high amusement value, and are equally useful for re-identification:

Jefferson students earn lower grades, are suspended more, feel less attached to school, and come from poorer families than those at comparable schools. They are more likely than students in other high schools to have trouble paying attention, have lower self-esteem, pray more, have fewer expectations about college, and are more likely to have a permanent tattoo.  Compared to other students in large disproportionately white schools, adolescents in Jefferson High are more likely to drink until they are drunk. In schools of comparable race and size, on average 30% of 10th-12th grade students smoke cigarettes regularly, whereas in Jefferson, 36% of all 10th to 12th graders smoke. Drug use is moderate, comparable to national norms.  Somewhat more than half of all students report having had sex, a rate comparable to the national average, and only slightly higher than observed for schools similar with respect to race and size.  Nevertheless, if Jefferson is not Middletown, it looks like an awful lot like it. The adolescents at Jefferson High are pretty normal. In describing the events of the past year, many students report that there is absolutely nothing to do in Jefferson. For fun, students like to drive to the outskirts of town and get drunk. Jefferson is a close-knit insular predominantly working-class community which offers few activities for its youth.

A database of public schools in the U.S. is available for sale for $75, containing very detailed information about each school. I’m quite confident that the information in there is sufficient to re-identify Jefferson High.

This thesis of this blog that the amount of entropy required to de-anonymize an individual — 33 bits — is low enough that it doesn’t offer meaningful protection in most circumstances. Obviously, the argument applies even more strongly to the anonymity of a well-defined group of people.

Let’s be clear: the paper is from 1994; who slept with whom in high school is not a huge deal a decade and a half later. However, the problem is systemic, and IRBs (Institutional Review Boards) keep blithely approving releases of data with such nominal de-identification applied. The re-identification of the institutional affiliation of an entire population of a study is of more concern from the privacy perspective than the de-anonymization of individual identities: it needs to be done only once, and affects hundreds or thousands of individuals.

Recently, a group of researchers from the Berkman Center released a dataset of Facebook profile information from an entire cohort (the class of 2009) of college students from “an anonymous, northeastern American university.” It was promptly de-anonymized by Michael Zimmer, who revealed that it was Harvard College:

As I noted here, the press release and the public codebook for the dataset provided many clues to where the data came from: we know it is a northeastern US university, it is private, co-ed, and whose class of 2009 initially had 1640 students in it. A quick search for schools reveals there are only 7 private, co-ed colleges in New England states (CT, ME, MA, NH, R , VT) with total undergraduate populations between 5000 and 7500 students (a likely range if there were 1640 in the 2006 freshman class): Tufts University, Suffolk University, Yale University, University of Hartford, Quinnipiac University, Brown University, and Harvard College.

[...]

Finally, and perhaps most convincingly, only Harvard College offers the specific variety of the subjects’ majors that are listed in the codebook. While nearly all univerersities offer the common majors of “History”, “Chemistry” or “Economics”, one only needs to search for the more uniquely phrased majors to discover a shared home institution.

Another amusing example is a paper on mobile phone call graphs which attempts to keep the identity of an entire country secret. I found that the approximate population of the country reported in the paper together with the mobile phone penetration rate is sufficient to uniquely identify it.

Suppressing the identity of your study population has some privacy benefits: at least, it won’t show up in google searches. But relying on it for any kind of serious privacy protection would be foolish. Scrubbing an entire dataset or research paper of clues about the study population can be hard or impossible; further, a single study participant corroborating the published results or methodology might be sufficient for de-anonymization of the group. The only solution is therefore to assume that the identity of the study population will be discovered, and to try to ensure that individual identities will still be safe from re-identification.

3 comments December 15, 2008

Graph Isomorphism: Deceptively Hard

This is the first in a series of loosely related posts that will lead up to an announcement of new research results.

Asymptotic complexity is often very useful in succintly expressing how hard a problem is. Occasionally, though, it can obfuscate the picture. Graph isomorphism is a perfect example.

Many people mistakenly believe that graph isomorphism (GI) is hard — either NP-hard, or hard enough to be insoluble in practice for large problem sizes. Certainly the complexity of the best known algorithm — exp(O(sqrt(n log n))), due to Babai and Luks — would appear to support this belief. However, the truth is that GI belongs to its own complexity class, not known to be NP-hard nor known to be soluble in polynomial time. More importantly, on real inputs, graph isomorphism is ridiculously easy. So easy that real-life solvers can plow through hundreds of thousands of instances per second.

Why is this the case? It turns out that on random graphs, GI is solvable in a very straightforward way. You only need to look at the local structure of each node — due to the randomness, every node has a distinctive neighborhood. Thus, the complexity of the worst case input and the average case input are on opposite ends of a spectrum. To see where “real” graphs fall on this spectrum, note that generative models of real graphs have a regular part and a random part. The regular part is captured in rules such as preferential attachment, but such rules only induce a probability distribution; there is still quite a bit of coin tossing needed to generate each edge. Intuitively, if there is “enough” randomness in the neighborhood of each node, then you only need to look at the local structure to tell different nodes apart. Thus, real-world graphs behave pretty much like random graphs.

It gets better: one of the uses of nauty, the leading graph isomorphism solver, is to find hard instances of graph isomorphism. The way this is done is by generating hundreds of candidate hard instances, solving them, and picking the ones that are the hardest to solve. (These are usually highly specialized graphs which are strongly regular and have fearful symmetry groups.) It would appear, then, that finding hard inputs for GI is GI-hard! I wonder if this property can be formally defined, and if there are other known problems for which this is true. This is a similar notion to Impagliazzo’s Heuristica, a world where NP != P, but for any problem in NP, and for inputs drawn from any samplable distribution (i.e, for any “real-world” input), there exists a polynomial time algorithm to decide it.

An example of the kind of input that gives graph isomorphism solvers any trouble.

.

While I find this an interesting theory problem, and would love to hear opinions on it from theorists, my reason for posting this is to point out that with all the current evidence, graph isomorphism can be assumed to be solvable in randomized polynomial time for any input that you would actually encounter in reality.

Add comment November 20, 2008


Me, elsewhere

Get notified

Be notified when there's a new post — subscribe to the feed, follow me on twitter or use the email subscription box below.

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Recent comments

Tags

a-rod academia aggregation algorithm algorithms anonymity author recognition censorship conference data de-anonymization DNA DNA profiling eccentricity entropy ethics facebook forensics free speech FTC genome Google google buzz Google docs graph isomorphism history stealing Internet k-anonymity law lending club livejournal location meta netflix privacy privacy by design privacy policy re-identification social network analysis social networks stylometry theory ubercookies web browsers web security