Lendingclub.com: A De-anonymization Walkthrough

November 12, 2008 at 8:14 pm 8 comments

The AOL and Netflix privacy incidents have shown that people responsible for data release at these companies do not put themselves in the potential attacker’s shoes in order to reason about privacy. The only rule that is ever applied is “remove personally identifiable information,” which has been repeatedly shown not to work. This fallacy deserves a post of its own, and so I will leave it at that for now.

The reality is that there is no way to guarantee privacy of published customer data without going through complex, data-driven reasoning. So let me give you an attacker’s-eye-view account of a de-anonymization I carried out last week—perhaps an understanding of the adversarial process will help reason about the privacy risks of data release.

Lending Club, a company specializing in peer-to-peer loans, makes the financial information collected from their customers (borrowers) publicly available. I learned of this a week ago, and there are around 4,600 users in the dataset as of now. This could be a textbook example illustrating a variety of types of sensitive information and a variety of attack methods to identify the individuals! Each record contains the following information:

I.    Screen name
II.   Loan Title, Loan Description,
III.  Location, Hometown, Home Ownership, Current Employer, Previous Employers, Education, Associations
IV. Amount Requested, Interest Rate, APR, Loan Length, Amount Funded, Number of Lenders, Expiration Date, Status, Application Date
V.  Credit Rating, Tenure, Monthly Income, Debt-To-Income Ratio, FICO Range, Earliest Credit Line,Open Credit Lines,Total Credit Lines, Revolving Credit Balance, Revolving Line Utilization,Inquiries in the Last 6 Months, Accounts Now Delinquent, Delinquent Amount, Delinquencies (Last 2 yrs), Months Since Last Delinquency, Public Records On File, Months Since Last Record

What data is sensitive?

Of course, any of the above fields might be considered sensitive by one or another user, but there are two types of data that are of particular concern: financial data and the loan description. The financial data includes monthly income, credit rating and FICO credit score; enough said. Loan description is an interesting column. A few users just put in “student loans” or “consolidate credit card debt.” However, a more informative description is the norm, such as this one:

This loan will be used to pay off my 19% Business Credit Card with AMEX.   I have supporting documentation to prove my personal Income. I would much rater get a loan and pay back fixed amount each month rather then being charged more and more each month on the same balance.   I can afford to pay at min $800 a month. I have 4 Reserves in the bank and have over 70% of my credit limit open for use.

Often, users reveal a lot about their personal life in the hope of appealing to the emotions of the prospective lender. Here’s an example (this is fairly common in the data):

My husband’s lawyer has told us that we need $5000 up front to pay for his child custody case. We are going to file for primary custody. Right now he has no visitation rights according to their divorce agreement. His ex-wife has been evicted twice in the four months and is living with 2 of their 3 daughters in a two bedroom apartment with her boyfriend. She has no job or car and the only money they have is what we give them in child support and she blows all of it on junk. We have a 2000+ square foot house, both have stable jobs, and our own cars. Both girls(12 and 15 years old) are allowed to go and do whatever they please even though they are failing classes at school. We are clearly the better situation for them to be raised in but we simply do not have that much money all at once. We would be able to pay around $200 per month for repayment.

A few loan descriptions are quite hilarious.  This one is my personal favorite.

Who’s the “bad guy” and what might they do with data of this kind, assuming it can be re-identified with the individuals in question? Certainly, it would help shady characters carry out identity theft. But there is also the unpleasant possibility that a customer’s family members or a boss might learn something about them that the customer didn’t intend them to know. The techniques below focus on the former threat model, en masse de-anonymization. The latter is even easier to carry out since human intelligence can be applied.

How to de-anonymize

The “screen name” field

Releasing the screen name seems totally unnecessary. Many people use a unique username everywhere (in fact, this tendency is so strong that there is a website to automate the process of testing your favorite username across websites). Often, googling a username brings up a profile page on other websites. Furthermore, these results can be further pruned in an automated way by looking at the profile information on the page. Here is an example (mjchrissy) taken from the Lending Club dataset. By obvserving that the person in the MySpace profile is in the same geographical location (NJ) as the person in the dataset, we can be reasonably sure that it’s the same person.

To measure the overall vulnerability, I wrote a script to find the estimated Google results count for each username in the dataset, using Google’s search API. If there are less than 100 results, I consider the person to be highly vulnerable to this attack; if there are between 100 and 1,000, they are moderately vulnerable. The Google count is only an approximate measure. For example, the estimated count for my standard username (randomwalker) is in the tens of thousands, but most of the results in the first few pages relate to me, and again, this can be confirmed by parsing the profile pages that are found by the search. Also, the query can be made more specific by using auxiliary terms such as “user” and “profile.” For example, the username radiothermal, also from the dataset, appears to be a normal word with tens of thousands of hits, but with the word “profile” thrown in, we get their identity right away.

Some users choose their email address as their username. This can be considered as immediately compromising their identity even if there are no google search results for it. Finally, there are users who use their real name as their screen name. This is harder to measure, but we can get a lower bound with a clever enough script. (You can find my script here; I’m quite proud of it :-)) The table below summarizes the different types and level of risk. Note that some of the categories are overlapping; the total number of high-risk records is 1725 and the total number of medium-risk records is 939.

Risk type
Risk level No. of users
result count = 0 low 1198
0 < result count < 100 high 1610
100 <= result count < 1000 medium 560
1000 <= result count low 1196
username is email high 51
either first or last name medium 429
both first and last name high 204

.

Location and work-related fields

The combination of hometown, current location, employer, previous employer and education (i.e, college) should be uniquely identifying for modern Americans, considering how mobile we are (except if you live in a rural town and have never left there). In fact, any 3 or 4 of those fields will probably do. As a sanity check, I verified that there are no duplicates on these fields within the database itself.

Amusingly, there were around 40 duplicates and even a few triplicates, but all of these turned out to be people re-listing their information in the hope of increasing their chances of getting funded. Since the dataset consists of only approved loans, all of these people were approved multiple times! This is a great example of how k-anonymity breaks down in a natural way. [k-anonymity is an intuitively appealing but fundamentally flawed approach to protecting privacy that tries to make each record indistinguishable from a few other records. Here is a technical paper showing that k-anonymity and related methods such as l-diversity are useless. This is again something that deserves its own post, and so I won’t belabor the point.]

While I’m sure that auxiliary information exists to de-anonymize people based on these fields, I’m not sure what’s the easiest way to get it, considering that It needs to be put together from a variety of different sources. Companies such as Choicepoint probably have this data in one place already, but you need a name or social security number to search. Instead, screen-scraping social network sites would be a good way to start aggregating this information. Once auxiliary information is available, the re-identification process is trivial algorithmically.

The “Associations” field

I love this field, since it is very similar to the high dimensional data in the Netflix paper. Since Lendingclub was launched as a Facebook application, it appears that they are asking for everyone’s Facebook groups. Anyone who is familiar with de-anonymizing high-dimensional data would know that you only need 3-4 items to uniquely identify a person. It gets worse: the Facebook API allows you to get user’s names and affiliations by searching for group membership. You can use the affiliations field (which is a list of networks you belong to, and is distinct from the group memberships) to narrow things down once you get to a few tens or even hundreds of candidate users. This gives you a person’s identity in the most concrete manner possible: a Facebook id, name and picture.

How many users are vulnerable? Based on manually analyzing a small sample of users, it appears that (roughly) anyone with three or more groups listed is vulnerable, so around 300. (Users with two listed groups may be vulnerable if they are both not very popular, and users with many groups may not be vulnerable if they are all popular, but let’s ignore that.)

lendingclub

Now, automating the de-anonymization is hard, since the group name is presented as free form text. The field separator (comma) that separates different group names in the same cell appears in the names of groups as well! Secondly, the Facebook API doesn’t allow you to search by group name.

I managed to overcome both of these limitations. I wrote a script that evaluates the context around a comma and determines if it occurs at the boundary of a group name or in the middle of it. Mapping a group name to a Facebook group id is a much harder problem. One possible solution is to use a Google search, and parse the “gid” parameter from the from the url of matching search results. Example: “Addicted to Taco Bell site:facebook.com.” There are various hacks that can be used to refine it, such as putting the group name in quotes or using Google’s “allinurl:” to match the pattern of the Facebook group page URL’s.

The other strategy, and the one that I pursued, is to use the search feature on Facebook itself. A higher percentage of searches succeed with this approach, but it is harder because I needed to parse the HTML that is returned. With either strategy, the hardest part is in distinguishing between multiple groups that often have almost identical names. My current strategy succeeds for about one-third of the groups, and maps the group name to either a single gid or a short list of candidate gids. I suspect that a combination of Google and Facebook searches would work best. Of course, using human intelligence would increase the yield considerably.

The final step here is to get the group members via the Facebook Query language, find the users who are common to all the listed groups, and use the affiliations to further prune the set of users. I’ve written the FQL query and verified that it works. Running it en-masse is a little slow, however, since the query takes a long time to return. I’ll probably run it when I have some more free time to analyze the results.

Let’s summarize

The interesting thing about this dataset is that Lending Club makes it very clear in their privacy policy that they publish the data in this fashion. And yet, it seems that intuitively, this is an egregious violation of privacy, no matter what the privacy policy might say. I will have more to say on this soon.

Almost everyone in the dataset can be re-identified if their location and work information is known, although this information is a little hard to gather on a large scale. The majority of customers are vulnerable to some extent because of identifying usernames, and more than a third are highly vulnerable. The privacy policy does state that the username will be shared in conjunction with other information, but can users really be expected to be aware of how easy it is to automate re-identification via their username? More importantly, why publish the username? What were they thinking? And certainly, the possbility of re-identification via their group associations must come as a complete surprise to most customers.

In general, what does an attacker need to carry out de-anonymization attacks of the sort described here? A little ingenuity in looking for auxiliary information is a must. Being able to write clients for different APIs, and also screen scraping code is very helpful. Finally, there a number of tasks involving a little bit of “AI,” such as matching group names, for which there is no straightforward algorithm but where using different heuristics can get you very close to an optimal solution.

Thanks to David Molnar for helping me figure out Facebook’s and Google’s APIs. Thanks to Vitaly Shmatikov and David Molnar for reading a draft of this essay.

Entry filed under: Uncategorized. Tags: , , , , , .

Bay Area Visit/Talk Schedule Graph Isomorphism: Deceptively Hard

8 Comments Add your own

  • 1. Fred  |  November 27, 2008 at 4:23 am

    Another interesting detail is that Lending Club’s “Member map” on their home page seems to place members by ZIP, rather than by city and state (which is on the listings). More fodder for baddies trying to root out users’ identities.

    Reply
  • 2. Arvind  |  November 27, 2008 at 4:29 am

    Fred: very interesting. I wonder if there is a way to extract the zip codes automatically. I’m guessing it should be possible with a bit of javascript-fu.

    Reply
  • […] remind us that websites that collect and republish seemingly innocuous facts about their users are often vulnerable to data mining. It doesn’t matter whether you keep the users’ names and addresses secret — the […]

    Reply
  • 4. fsu.edu | Artech Blog  |  March 30, 2009 at 3:19 pm

    […] remind us that websites that collect and republish seemingly innocuous facts about their users are often vulnerable to data mining. It doesn’t matter whether you keep the users’ names and addresses secret — the facts […]

    Reply
  • […] have probably seen this post about how Lending Club’s practice of publishing certain borrower information makes the […]

    Reply
  • 6. Gary in Clearwateer  |  April 16, 2009 at 7:16 pm

    Conceptually it makes sense, but as most of the posters have already acknowledged, I would think that the cumulative distribution of information would be a significant target for data mining. And they identify particularly ‘sensitive’ areas as financial data and loan description, but what about other types of data that they may consider ‘not-so-sensitive’ but which could ultimately compromise someone’s identity when collected longitudinally?

    Reply
  • 7. CA  |  September 14, 2009 at 2:45 am

    Extracting zip codes is rather easy, the data is available through the census beauru. It’s like stealing candy from a baby :P. I had to do it for the mailers for my new not-for-profit website, in just california.

    Reply
  • […] Group affiliations, just like your movie-watching history and many other types of attributes, are sufficient to fingerprint a user. There’s a high chance there’s no one else who belongs to the same set of groups that you do (or is even close). [Aside: I used this fact to show that Lending Club data can be de-anonymized.] […]

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


About 33bits.org

I'm an assistant professor of computer science at Princeton. I research (and teach) information privacy and security, and moonlight in technology policy.

This is a blog about my research on breaking data anonymization, and more broadly about information privacy, law and policy.

For an explanation of the blog title and more info, see the About page.

Subscribe

Be notified when there's a new post — subscribe to the feed, follow me on Google+ or twitter or use the email subscription box below.

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 218 other followers