Posts tagged ‘anonymity’
Some people claim that re-identification attacks don’t matter, the reasoning being: “I’m not important enough for anyone to want to invest time on learning private facts about me.” At first sight that seems like a reasonable argument, at least in the context of the re-identification algorithms I have worked on, which require considerable human and machine effort to implement.
The argument is nonetheless fallacious, because re-identification typically doesn’t happen at the level of the individual. Rather, the investment of effort yields results over the entire database of millions of people (hence the emphasis on “large-scale” or “en masse”.) On the other hand, the harm that occurs from re-identification affects individuals. This asymmetry exists because the party interested in re-identifying you and the party carrying out the re-identification are not the same.
In today’s world, the entities most interested in acquiring and de-anonymizing large databases might be data aggregation companies like ChoicePoint that sell intelligence on individuals, whereas the party interested in using the re-identified information about you would be their clients/customers: law enforcement, an employer, an insurance company, or even a former friend out to slander you.
Data passes through multiple companies or entities before reaching its destination, making it hard to prove or even detect that it originated from a de-anonymized database. There are lots of companies known to sell “anonymized” customer data: for example Practice Fusion “subsidizes its free EMRs by selling de-identified data to insurance groups, clinical researchers and pharmaceutical companies.” On the other hand, companies carrying out data aggregation/de-anonymization are a lot more secretive about it.
Another piece of the puzzle is what happens when a company goes bankrupt. Decode genetics recently did, which is particularly interesting because they are sitting on a ton of genetic data. There are privacy assurances in place in their original Terms of Service with their customers, but will that bind the new owner of the assets? These are legal gray areas, and are frequently exploited by companies looking to acquire data.
At the recent FTC privacy roundtable, Scott Taylor of Hewlett Packard said his company regularly had the problem of not being able to determine where data is being shared downstream after the first point of contact. I’m sure the same is true of other companies as well. (How then could we possibly expect third-party oversight of this process?) Since data fuels the modern Web economy, I suspect that the process of moving data around will continue to become more common as well as more complex, with more steps in the chain. We could use a good name for it — “data laundering,” perhaps?
It is futile to try to stay anonymous by getting your name or data purged from the Internet, once it is already out there. Attempts at such censorship have backfired repeatedly and spectacularly, giving rise to the term Streisand effect. A recent lawsuit provides the latest demonstration: two convicted German killers (who have completed their prison sentences) are attempting to prevent Wikipedia from identifying them.
The law in Germany tries to “protect the name and likenesses of private persons from unwanted publicity.” Of course, the Wikimedia foundation is based in the United States, and this attempt runs head-on into the First Amendment, the right to Free Speech. European countries have a variety of restrictions on speech—Holocaust denial is illegal, for instance. But there is little doubt about how U.S. courts will see the issue; Jennifer Granick of the EFF has a nice write-up.
The aspect that interests me is that even if there weren’t a Free Speech issue, it would be utterly impossible for the court system to keep the names of these men from the Internet. I wonder if the German judge who awarded a judgment against the Wikimedia foundation was aware that it would achieve exactly the “unwanted publicity” that the law was intended to avoid. He would probably have ruled as he did in any case, but it is interesting to speculate.
Legislators, on the other hand, would do well to be aware of the limitations of censorship, and the need to update laws to reflect the rules of the information age. There are always alternatives, although they usually involve trade-offs. In this instance, perhaps one option is a state-supplied alternate identity, analogous to the Witness Protection Program?
Returning to the issue of enforceability, the European doctrine apparently falls under “rights of the personality,” specifically the “right to be forgotten,” according to this paper that discusses the trans-atlantic clash. I find the very name rather absurd; it reminds me of attempting not to think of an elephant (try it!)
The above paper, written from the European perspective, laments the irreconcilable differences between the two viewpoints on the issue of Free Speech vs. Privacy. However, there is no discussion of enforceability. The author does suspect, in the final paragraph, that the European doctrine will become rather meaningless due to the Internet, but he believes this to be purely a consequence of the fact that the U.S. courts have put Free Speech first.
I don’t buy it—even if the U.S. courts joined Europe in recognizing a “right to be forgotten,” it would still be essentially unenforceable. Copyright-based rather than privacy-based censorship attempts offer us a lesson here. Copyright law has international scope, due to being standardized by the WIPO, and yet the attempt to take down the AACS encryption key was pitifully unsuccessful.
Taking down a repeat offender (such as a torrent tracker) or a large file (the Windows 2000 source code leak) might be easier. But if we’re talking about a small piece of data, the only factor that seems to matter is the level of public interest in the sensitive information. The only times when censorship of individual facts has been (somewhat) successful in the face of public sentiment is within oppressive regimes with centralized Internet filters.
There are many laws, particularly privacy laws, that need to be revamped for the digital age. What might appear obvious to technologists might be much less apparent to law scholars, lawmakers and the courts. I’ve said it before on this blog, but it bears repeating: there is an acute need for greater interdisciplinary collaboration between technology and the law.
The State of Oklahoma just passed legislation requiring that detailed information about every abortion performed in the state be submitted to the State Department of Health. Reports based on this data are to be made publicly available. The controversy around the law gained steam rapidly after bloggers revealed that even though names and addresses of mothers obtaining abortions were not collected, the women could nevertheless be re-identified from the published data based on a variety of other required attributes such as the date of abortion, age and race, county, etc.
As a computer scientist studying re-identification, this was brought to my attention. I was as indignant on hearing about it as the next smug Californian, and I promptly wrote up a blog post analyzing the serious risk of re-identification based on the answers to the 37 questions that each mother must anonymously report. Just before posting it, however, I decided to give the text of the law a more careful reading, and realized that the bloggers have been misinterpreting the law all along.
While it is true that the law requires submitting a detailed form to the Department of Health, the only information that is made public are annual reports with statistical tallies of the number of abortions performed under very broad categories, which presents a negligible to non-existent re-identification risk.
I’m not defending the law; that is outside my sphere of competence. There do appear to be other serious problems with it, outlined in a lawsuit aimed at stopping the law from going into effect. The text of this complaint, as Paul Ohm notes, does not raise the “public posting” claim. Besides, the wording of the law is very ambiguous, and I can certainly see why it might have been misinterpreted.
But I do want to lament the fact that bloggers and special interest groups can start a controversy based on a careless (or less often, deliberate) misunderstanding, and have it amplified by an emerging category of news outlets like the Huffington post, which have the credibility of blogs but a readership approaching traditional media. At this point the outrage becomes self-sustaining, and the factual inaccuracies become impossible to combat. I’m reminded of the affair of the gay sheep.
Philippe Golle and Kurt Partridge of PARC have a cute paper (pdf) on the anonymity of geo-location data. They analyze data from the U.S. Census and show that for the average person, knowing their approximate home and work locations — to a block level — identifies them uniquely.
Even if we look at the much coarser granularity of a census tract — tracts correspond roughly to ZIP codes; there are on average 1,500 people per census tract — for the average person, there are only around 20 other people who share the same home and work location. There’s more: 5% of people are uniquely identified by their home and work locations even if it is known only at the census tract level. One reason for this is that people who live and work in very different areas (say, different counties) are much more easily identifiable, as one might expect.
The paper is timely, because Location Based Services are proliferating rapidly. To understand the privacy threats, we need to ask the two usual questions:
- who has access to anonymized location data?
- how can they get access to auxiliary data linking people to location pairs, which they can then use to carry out re-identification?
The authors don’t say much about these questions, but that’s probably because there are too many possibilities to list! In this post I will examine a few.
GPS navigation. This is the most obvious application that comes to mind, and probably the most privacy-sensitive: there have been many controversies around tracking of vehicle movements, such as NYC cab drivers threatening to strike. The privacy goal is to keep the location trail of the user/vehicle unknown even to the service provider — unlike in the context of social networks, people often don’t even trust the service provider. There are several papers on anonymizing GPS-related queries, but there doesn’t seem to be much you can do to hide the origin and destination except via charmingly unrealistic cryptographic protocols.
The accuracy of GPS is a few tens or few hundreds of feet, which is the same order of magnitude as a city block. So your daily commute is pretty much unique. If you took a (GPS-enabled) cab home from work at a certain time, there’s a good chance the trip can be tied to you. If you made a detour to stop somewhere, the location of your stop can probably be determined. This is true even if there is no record tying you to a specific vehicle.
Location based social networking. Pretty soon, every smartphone will be capable of running applications that transmit location data to web services. Google Latitude and Loopt are two of the major players in this space, providing some very nifty social networking functionality on top of location awareness. It is quite tempting for service providers to outsource research/data-mining by sharing de-identified data. I don’t know if anything of the sort is being done yet, but I think it is clear that de-identification would offer very little privacy protection in this context. If a pair of locations is uniquely identifying, a trail is emphatically so.
The same threat also applies to data being subpoena’d, so data retention policies need to take into consideration the uselessness of anonymizing location data.
I don’t know if cellular carriers themselves collect a location trail from phones as a matter of course. Any idea?
Plain old web browsing. Every website worth the name identifies you with a cookie, whether you log in or not. So if you browse the web from a laptop or mobile phone from both home and work, your home and work IP addresses can be tied together based on the cookie. There are a number of free or paid databases for turning IP addresses into geographical locations. These are generally accurate up to the city level, but beyond that the accuracy is shaky.
A more accurate location fix can be obtained by IDing WiFi access points. This is a curious technological marvel that is not widely known. Skyhook, Inc. has spent years wardriving the country (and abroad) to map out the MAC addresses of wireless routers. Given the MAC address of an access point, their database can tell you where it is located. There are browser add-ons that query Skyhook’s database and determine the user’s current location. Note that you don’t have to be browsing wirelessly — all you need is at least one WiFi access point within range. This information can then be transmitted to websites which can provide location-based functionality; Opera, in particular, has teamed up with Skyhook and is “looking forward to a future where geolocation data is as assumed part of the browsing experience.” The protocol by which the browser communicates geolocation to the website is being standardized by the W3C.
The good news from the privacy standpoint is that the accurate geolocation technologies like the Skyhook plug-in (and a competing offering that is part of Google Gears) require user consent. However, I anticipate that once the plug-ins become common, websites will entice users to enable access by (correctly) pointing out that their location can only be determined to within a few hundred meters, and users will leave themselves vulnerable to inference attacks that make use of location pairs rather than individual locations.
Image metadata. An increasing number of cameras these days have (GPS-based) geotagging built-in and enabled by default. Even more awesome is the Eye-Fi card, which automatically uploads pictures you snap to Flickr (or any of dozens of other image sharing websites you can pick from) by connecting to available WiFi access points nearby. Some versions of the card do automatic geotagging in addition.
If you regularly post pseudonymously to (say) Flickr, then the geolocations of your pictures will probably reveal prominent clusters around the places you frequent, including your home and work. This can be combined with auxiliary data to tie the pictures to your identity.
Now let us turn to the other major question: what are the sources of auxiliary data that might link location pairs to identities? The easiest approach is probably to buy data from Acxiom, or another provider of direct-marketing address lists. Knowing approximate home and work locations, all that the attacker needs to do is to obtain data corresponding to both neighborhoods and do a “join,” i.e, find the (hopefully) unique common individual. This should be easy with Axciom, which lets you filter the list by “DMA code, census tract, state, MSA code, congressional district, census block group, county, ZIP code, ZIP range, radius, multi-location radius, carrier route, CBSA (whatever that is), area code, and phone prefix.”
Google and Facebook also know my home and work addresses, because I gave them that information. I expect that other major social networking sites also have such information on tens of millions of users. When one of these sites is the adversary — such as when you’re trying to browse anonymously — the adversary already has access to the auxiliary data. Google’s power in this context is amplified by the fact that they own DoubleClick, which lets them tie together your browsing activity on any number of different websites that are tracked by DoubleClick cookies.
Finally, while I’ve talked about image data being the target of de-anonymization, it may equally well be used as the auxiliary information that links a location pair to an identity — a non-anonymous Flickr account with sufficiently many geotagged photos probably reveals an identifiable user’s home and work locations. (Some attack techniques that I describe on this blog, such as crawling image metadata from Flickr to reveal people’s home and work locations, are computationally expensive to carry out on a large scale but not algorithmically hard; such attacks, as can be expected, will rapidly become more feasible with time.)
Summary. A number of devices in our daily lives transmit our physical location to service providers whom we don’t necessarily trust, and who keep might keep this data around or transmit it to third parties we don’t know about. The average user simply doesn’t have the patience to analyze and understand the privacy implications, making anonymity a misleadingly simple way to assuage their concerns. Unfortunately, anonymity breaks down very quickly when more than one location is associated with a person, as is usually the case.
A researcher who is working on writing style analysis (“stylometry”), after reading my post on related de-anonymization techniques, wonders what the positive impact of such research could be, given my statement that the malicious uses of the technology are far greater than the beneficial ones. He says:
Sometimes when I’m thinking of an interesting research topic it’s hard to forget the Patton Oswalt line “Hey, we made cancer airborne and contagious! You’re welcome! We’re science: we’re all about coulda, not shoulda.”
This was my answer:
To me, generic research on algorithms always has a positive impact (if you’re breaking a specific website or system, that’s a different story; a bioweapon is a whole different category.) I do not recognize a moral question here, and therefore it does not affect what I choose to work on.
My belief that the research will have a positive impact is not at odds with my belief that the uses of the technology are predominantly evil. In fact, the two are positively correlated. If we’re talking about web search technology, if academics don’t invent it, then (benevolent) companies will. But if we’re talking about de-anonymization technology, if we don’t do it, then malevolent entities will invent it (if they haven’t already), and of course, keep it to themselves. It comes down to a choice between a world where everyone has access to de-anonymization techniques, and hopefully defenses against it, versus one in which only the bad guys do. I think it’s pretty clear which world most people will choose to live in.
I realize I lean toward the “coulda” side of the question of whether Science is—or should be—amoral. Someone like Prof. Benjamin Kuipers here at UT seems to be close to the other end of the spectrum: he won’t take any DARPA money.
Part of the problem with allowing morality to affect the direction of science is that it is often arbitrary. The Patton Oswalt quote above is a perfect example: he apparently said that in response to news of science enabling a 63 year old woman to give birth. The notion that something is wrong simply because it is not “natural” is one that I find most repugnant. If the freedom of a 63 year old woman to give birth is not an important issue to you, let me note that more serious issues such as stem cell research, that could save lives, fall under the same category.
Going back to anonymity, it is interesting that tools like Tor face much criticism, but for enabling the anonymity of “bad” people rather than breaking the anonymity of “good” people. Who is to be the arbiter of the line between good and bad? I share the opinion of most techies that Tor is a wonderful thing for the world to have.
There are many sides to this issue and many possible views. I’d love to hear your thoughts.
Recently, the Alex Rodriguez steroid controversy has been in the news. The aspect that interests me is the manner in which it came to attention: A-Rod provided a urine sample as part of a supposedly anonymous survey of Major League Baseball players in 2003, the goal of which was to determine if more than 5% of players were using banned substances. When Federal agents came calling, the sample turned out to be not so anonymous after all.
The failure of anonymity here was total–the testing lab simply failed to destroy the samples or even take the labels off them, and the Players’ Union, which conducted the survey, failed to call the lab and ask them to do so during the more than one-week window that they had before the subpoena was issued.
However, there are a number of ways in which things could have gone wrong even if one or more of the parties had followed proper procedure. None of the scenarios below result in as straightforward an association between player and steroid use as we have seen. On the other hand, they can be just as damaging in the court of public opinion.
- If the samples were not destroyed, but simply de-identified, DNA can be recovered even after years, and the DNA can be used to match the player to the sample. You might argue the feds can’t easily get hold of players’ DNA to run such a matching, but once the association between drug test result and DNA has been made, it is a sword of Damocles hanging over the player’s head (note that A-Rod’s drug test happened six years ago.) The trend in recent years has been toward increased DNA profiling and bigger and bigger databases, and unlabeled samples therefore pose a clear danger.
- If the samples are destroyed, and the test results are stored in de-identified form, anonymity could still be compromised. A drug test measures the concentrations of a bunch of different chemicals in the urine. It is likely that this results in a “profile” that is characteristic of a person–just like a variety of other biometric characteristics. If the same player, having stopped the use of banned substances, provides another urine sample, it is possible that this profile can be matched to the old one based on the fact that most of the urine chemicals have not changed in concentration. It is an interesting research question to see how stable the “profiles” are, and what their discriminatory power is.
- Even more sophisticated attacks are possible. Let’s say that participant names are known, but other than that the only thing that’s released is a single statistic: the percentage of players that tested positive. Now, if the survey is performed on a regular basis, and a certain player (who happens to use steroids) participates only some of the time, the overall statistic is going to be slightly higher whenever that player participates. In spite of confounding factors, such as the fact that other players might also drop in and out, statistical techniques can be used to tease out this correlation.
This might sound like a tall order at first, but it is a proven attack strategy. The technique was used recently in a PLoS Genetics paper to identify if an individual had contributed DNA to an aggregate sample of hundreds of individuals.
I performed a quick experiment, assuming that there are 1,000 players in the sample, of which 100 participate half the time (the rest participate all the time). 5% of the players dope, and each player either dopes throughout the study period or not at all. Testing is done every 3 months; the list of participants in each wave of the survey is known, as well as the percentage of players who tested positive in each wave. I found that after 3 years, there is enough information to identify 80% of the cheating players who participate irregularly. (Players who participate regularly are clearly safe.)
[Technical note: that's an equal error rate of 20%; i.e, 20% of the cheating players are not accused, and 20% of the accused are innocent. There is a trade-off between the two numbers, as always; if a higher accuracy is required, say only 10% of accused players are innocent, then 65% of the cheating players can be identified.]
- When applicable, a combination of the above techniques such as matching de-identified profiles across different time-periods of a survey (or different surveys) can greatly increase the attacker’s potential.
The point of the above scenarios is to convince you that you can never, ever be certain that the connection between a person and their data has been definitively severed. Regular readers of this blog will know that this is a recurring theme of my research. The quantity of data being collected today and the computational power available have destroyed the traditional and ingrained assumptions about anonymity. Well-established procedures have been shown to be completely inadequate, and it is far from clear that things can be fixed. Anyone who cares about their privacy must be vigilant against giving up their data under false promises of anonymity.
The AOL and Netflix privacy incidents have shown that people responsible for data release at these companies do not put themselves in the potential attacker’s shoes in order to reason about privacy. The only rule that is ever applied is “remove personally identifiable information,” which has been repeatedly shown not to work. This fallacy deserves a post of its own, and so I will leave it at that for now.
The reality is that there is no way to guarantee privacy of published customer data without going through complex, data-driven reasoning. So let me give you an attacker’s-eye-view account of a de-anonymization I carried out last week—perhaps an understanding of the adversarial process will help reason about the privacy risks of data release.
Lending Club, a company specializing in peer-to-peer loans, makes the financial information collected from their customers (borrowers) publicly available. I learned of this a week ago, and there are around 4,600 users in the dataset as of now. This could be a textbook example illustrating a variety of types of sensitive information and a variety of attack methods to identify the individuals! Each record contains the following information:
I. Screen name
II. Loan Title, Loan Description,
III. Location, Hometown, Home Ownership, Current Employer, Previous Employers, Education, Associations
IV. Amount Requested, Interest Rate, APR, Loan Length, Amount Funded, Number of Lenders, Expiration Date, Status, Application Date
V. Credit Rating, Tenure, Monthly Income, Debt-To-Income Ratio, FICO Range, Earliest Credit Line,Open Credit Lines,Total Credit Lines, Revolving Credit Balance, Revolving Line Utilization,Inquiries in the Last 6 Months, Accounts Now Delinquent, Delinquent Amount, Delinquencies (Last 2 yrs), Months Since Last Delinquency, Public Records On File, Months Since Last Record
What data is sensitive?
Of course, any of the above fields might be considered sensitive by one or another user, but there are two types of data that are of particular concern: financial data and the loan description. The financial data includes monthly income, credit rating and FICO credit score; enough said. Loan description is an interesting column. A few users just put in “student loans” or “consolidate credit card debt.” However, a more informative description is the norm, such as this one:
This loan will be used to pay off my 19% Business Credit Card with AMEX. I have supporting documentation to prove my personal Income. I would much rater get a loan and pay back fixed amount each month rather then being charged more and more each month on the same balance. I can afford to pay at min $800 a month. I have 4 Reserves in the bank and have over 70% of my credit limit open for use.
Often, users reveal a lot about their personal life in the hope of appealing to the emotions of the prospective lender. Here’s an example (this is fairly common in the data):
My husband’s lawyer has told us that we need $5000 up front to pay for his child custody case. We are going to file for primary custody. Right now he has no visitation rights according to their divorce agreement. His ex-wife has been evicted twice in the four months and is living with 2 of their 3 daughters in a two bedroom apartment with her boyfriend. She has no job or car and the only money they have is what we give them in child support and she blows all of it on junk. We have a 2000+ square foot house, both have stable jobs, and our own cars. Both girls(12 and 15 years old) are allowed to go and do whatever they please even though they are failing classes at school. We are clearly the better situation for them to be raised in but we simply do not have that much money all at once. We would be able to pay around $200 per month for repayment.
A few loan descriptions are quite hilarious. This one is my personal favorite.
Who’s the “bad guy” and what might they do with data of this kind, assuming it can be re-identified with the individuals in question? Certainly, it would help shady characters carry out identity theft. But there is also the unpleasant possibility that a customer’s family members or a boss might learn something about them that the customer didn’t intend them to know. The techniques below focus on the former threat model, en masse de-anonymization. The latter is even easier to carry out since human intelligence can be applied.
How to de-anonymize
The “screen name” field
Releasing the screen name seems totally unnecessary. Many people use a unique username everywhere (in fact, this tendency is so strong that there is a website to automate the process of testing your favorite username across websites). Often, googling a username brings up a profile page on other websites. Furthermore, these results can be further pruned in an automated way by looking at the profile information on the page. Here is an example (mjchrissy) taken from the Lending Club dataset. By obvserving that the person in the MySpace profile is in the same geographical location (NJ) as the person in the dataset, we can be reasonably sure that it’s the same person.
To measure the overall vulnerability, I wrote a script to find the estimated Google results count for each username in the dataset, using Google’s search API. If there are less than 100 results, I consider the person to be highly vulnerable to this attack; if there are between 100 and 1,000, they are moderately vulnerable. The Google count is only an approximate measure. For example, the estimated count for my standard username (randomwalker) is in the tens of thousands, but most of the results in the first few pages relate to me, and again, this can be confirmed by parsing the profile pages that are found by the search. Also, the query can be made more specific by using auxiliary terms such as “user” and “profile.” For example, the username radiothermal, also from the dataset, appears to be a normal word with tens of thousands of hits, but with the word “profile” thrown in, we get their identity right away.
Some users choose their email address as their username. This can be considered as immediately compromising their identity even if there are no google search results for it. Finally, there are users who use their real name as their screen name. This is harder to measure, but we can get a lower bound with a clever enough script. (You can find my script here; I’m quite proud of it :-)) The table below summarizes the different types and level of risk. Note that some of the categories are overlapping; the total number of high-risk records is 1725 and the total number of medium-risk records is 939.
||Risk level||No. of users|
|result count = 0||low||1198|
|0 < result count < 100||high||1610|
|100 <= result count < 1000||medium||560|
|1000 <= result count||low||1196|
|username is email||high||51|
|either first or last name||medium||429|
|both first and last name||high||204|
Location and work-related fields
The combination of hometown, current location, employer, previous employer and education (i.e, college) should be uniquely identifying for modern Americans, considering how mobile we are (except if you live in a rural town and have never left there). In fact, any 3 or 4 of those fields will probably do. As a sanity check, I verified that there are no duplicates on these fields within the database itself.
Amusingly, there were around 40 duplicates and even a few triplicates, but all of these turned out to be people re-listing their information in the hope of increasing their chances of getting funded. Since the dataset consists of only approved loans, all of these people were approved multiple times! This is a great example of how k-anonymity breaks down in a natural way. [k-anonymity is an intuitively appealing but fundamentally flawed approach to protecting privacy that tries to make each record indistinguishable from a few other records. Here is a technical paper showing that k-anonymity and related methods such as l-diversity are useless. This is again something that deserves its own post, and so I won't belabor the point.]
While I’m sure that auxiliary information exists to de-anonymize people based on these fields, I’m not sure what’s the easiest way to get it, considering that It needs to be put together from a variety of different sources. Companies such as Choicepoint probably have this data in one place already, but you need a name or social security number to search. Instead, screen-scraping social network sites would be a good way to start aggregating this information. Once auxiliary information is available, the re-identification process is trivial algorithmically.
The “Associations” field
I love this field, since it is very similar to the high dimensional data in the Netflix paper. Since Lendingclub was launched as a Facebook application, it appears that they are asking for everyone’s Facebook groups. Anyone who is familiar with de-anonymizing high-dimensional data would know that you only need 3-4 items to uniquely identify a person. It gets worse: the Facebook API allows you to get user’s names and affiliations by searching for group membership. You can use the affiliations field (which is a list of networks you belong to, and is distinct from the group memberships) to narrow things down once you get to a few tens or even hundreds of candidate users. This gives you a person’s identity in the most concrete manner possible: a Facebook id, name and picture.
How many users are vulnerable? Based on manually analyzing a small sample of users, it appears that (roughly) anyone with three or more groups listed is vulnerable, so around 300. (Users with two listed groups may be vulnerable if they are both not very popular, and users with many groups may not be vulnerable if they are all popular, but let’s ignore that.)
Now, automating the de-anonymization is hard, since the group name is presented as free form text. The field separator (comma) that separates different group names in the same cell appears in the names of groups as well! Secondly, the Facebook API doesn’t allow you to search by group name.
I managed to overcome both of these limitations. I wrote a script that evaluates the context around a comma and determines if it occurs at the boundary of a group name or in the middle of it. Mapping a group name to a Facebook group id is a much harder problem. One possible solution is to use a Google search, and parse the “gid” parameter from the from the url of matching search results. Example: “Addicted to Taco Bell site:facebook.com.” There are various hacks that can be used to refine it, such as putting the group name in quotes or using Google’s “allinurl:” to match the pattern of the Facebook group page URL’s.
The other strategy, and the one that I pursued, is to use the search feature on Facebook itself. A higher percentage of searches succeed with this approach, but it is harder because I needed to parse the HTML that is returned. With either strategy, the hardest part is in distinguishing between multiple groups that often have almost identical names. My current strategy succeeds for about one-third of the groups, and maps the group name to either a single gid or a short list of candidate gids. I suspect that a combination of Google and Facebook searches would work best. Of course, using human intelligence would increase the yield considerably.
The final step here is to get the group members via the Facebook Query language, find the users who are common to all the listed groups, and use the affiliations to further prune the set of users. I’ve written the FQL query and verified that it works. Running it en-masse is a little slow, however, since the query takes a long time to return. I’ll probably run it when I have some more free time to analyze the results.
In general, what does an attacker need to carry out de-anonymization attacks of the sort described here? A little ingenuity in looking for auxiliary information is a must. Being able to write clients for different APIs, and also screen scraping code is very helpful. Finally, there a number of tasks involving a little bit of “AI,” such as matching group names, for which there is no straightforward algorithm but where using different heuristics can get you very close to an optimal solution.
Thanks to David Molnar for helping me figure out Facebook’s and Google’s APIs. Thanks to Vitaly Shmatikov and David Molnar for reading a draft of this essay. (more…)