Daniele Perito, Claude Castelluccia, Mohamed Ali Kaafar, and Pere Manils have a neat paper “How Unique and Traceable are Usernames?“ that addresses the following question:
Suppose you find the same username on different online services, what is the probability that these usernames refer to the same physical person?
The background for this investigation is that there is tremendous commercial value in linking together every piece of online information about an individual. While the academic study of constructing “uber-profiles” by linking social profiles is new (see Large Online Social Footprints—An Emerging Threat for another example), commercial firms have long been scraping profiles, aggregating them, and selling them on the grey market. Well-known public-facing aggregators such as Spokeo mainly use public records, but online profiles are quickly becoming part of the game.
Paul Ohm has even talked of a “database of ruin.” No matter what moral view one takes of this aggregation, the technical questions are fascinating.
The research on Record Linkage could fill an encyclopedia (see here for a survey) but most of it studies traditional data types such as names and addresses. This paper is thus a nice complement.
Usernames are particularly useful for carrying out linkage across different sites for two reasons:
- They are almost always available, especially on systems with pseudonymous accounts.
- When comparing two databases of profiles, usernames are a good way to quickly find candidate matches before exploring other attributes.
The mathematical heavy-lifting that the authors do is described by the following:
… we devise an analytical model to estimate the uniqueness of a user name, which can in turn be used to assign a probability that a single username, from two different online services, refers to the same user
we extend this model to cases when usernames are different across many online services … experimental data shows that users tend to choose closely related usernames on different services.
For example, my Google handle is ‘randomwalker’ and my twitter username is ‘random_walker’. Perito et al’s model can calculate how obscure the username ‘random_walker’ is, as well as how likely it is that ‘random_walker’ is a mutation of ‘randomwalker’, and come up with a combined score representing the probability that the two accounts refer to the same person. Impressive.
The authors also present experimental results. For example, they find that with a sample of 20,000 usernames drawn from a real dataset, their algorithms can find the right match about 60% of the time with a negligible error rate (i.e., 40% of the time it doesn’t produce a match, but it almost never errs.) That said, I find the main strength of the paper to be in the techniques more than the numbers.
Their models know all about the underlying natural language patterns, such as the fact that ‘random_walker’ is more meaningful than say ‘rand_omwalker’. This is achieved using what are called Markov models. I really like this class of techniques; I used Markov models many years ago in my paper on password cracking with Vitaly Shmatikov to model how people pick passwords.
The setting studied by Perito et al. is when two or more offline databases of usernames are available. Another question worth considering is determining the identity of a person behind a username via automated web searches. See my post on de-anonymizing Lending Club data for an empirical analysis of this.
There is a lot to be said about the psychology behind username choice. Ben Gross’s dissertation is a fascinating look at the choice of identifiers for self-representation. I myself am very attached to ‘randomwalker’; I’m not sure why that is.
A philosophical question related to this research is whether it is better to pick a unique username or a common one. The good thing about a unique username is that you stand out from the crowd. The bad thing about a unique username is that you stand out from the crowd. The question gets even more interesting (and consequential) if you’re balancing Googlability and anonymity in the context of naming your child, but that’s a topic for another day.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
February 16, 2011 at 5:19 pm
It is natural for developers building web applications to operate under a public/private dichotomy, the assumption being that if a user made a piece of data public, then they’ve given up any privacy expectation. But as we saw in a previous article, users often expect more subtle distinctions, and many unfortunate privacy blunders have resulted. To avoid repeats of these, engineers need to be able to reason about the privacy implications of specific technical features. This article presents a set of criteria for doing so.
Computers are designed to keep data around forever unless explicitly deleted. But this assumption makes many nontechnical people deeply uncomfortable. There have been a number of proposals to “make the Internet forget,” bringing it in line with humans’ anthropomorphic expectations. While nothing much will probably result from these broad proposals, there need to be some controls on archiving, especially by third parties. Here are three examples that illustrate why this is important:
- A woman was fired from her job recently because of her employer found some of her online revelations objectionable. She got caught because Topsy, a Twitter search engine, retained her personal data in its cache even after she had deleted it from Twitter.
- Joe Bonneau revealed that the vulnerability of photo-sharing sites failing to delete photos from their CDN caches persists on many sites, a full year after it was first made public and received media attention.
The examples above show a clear need for a standard for machine-readable third-party data retention policy — a robots.txt on steroids, if you will. Pete Warden proposed expanding robots.txt a few months ago; now that multiple sites are facing this problem, perhaps there will be some momentum in this direction.
The real-time web relies on “pushing” updates to clients instead of the traditional model of crawling. The push model greatly improves timeliness and machine load, but the problem is that there is typically no way to delete or update existing items in real-time.
This fact bites me on a regular basis. When I make a blog post, Google reader gets hold of it immediately, but if I realize I wrote something stupid and update the post, it doesn’t show up for several hours because updates don’t propagate through the real-time mechanism.
Or consider tweets: if you tweet something inappropriate and delete it a second later, it might be too late: Twitter’s partners could have already gotten hold of it through the “firehose,” and it might already be displayed on a sidebar on some other site.
Google’s “undo send” feature is a great solution to this type of problem — it holds the message in a queue for a few seconds before sending it out. Every real-time system needs such a panic feature!
While making data searchable greatly increases its utility, it also dramatically increases the privacy risks. It is tempting to tell users to get used to the fact that everything they write is searchable, but that hasn’t been successful so far, as IRSeek found out when they tried to launch an IRC search engine. There are entire companies like ReputationDefender that help you clean up the web search results for your name.
The lack of searchability of your site can be a feature. This is obviously not true for the majority of sites, but it is worth keeping in mind. One major reason why LiveJournal has a “closed” feel — which is a big part of its appeal — is that posts don’t rank well in Google searches, if they are indexed at all. For example, Livejournal posts have a numeric ID instead of title words in the URL. Although it sounds like someone skipped SEO 101, it is actually by design.
By aggregate data I mean data from a single source or website, comprising all or a significant fraction of the users. The appeal of aggregate data for research is clear: not only are larger quantities better, aggregation avoids the bias problems of sampling. On the other hand, the privacy concerns are also clear: the fear is that the data will end up at the hands of the wrong people, such as one of the database marketing companies.
Aggregation is the most common of the privacy problems among the 7 examples I listed in my previous article. In some cases the original source made the data available and then backtracked, in other cases a third party crawled the data and got into trouble, and some were a mix of both.
For websites sitting on interesting data, an excellent compromise would be in-house data analysis (or perhaps a partnership program with outside researchers), as an alternative to making data public. OkCupid has been doing this extremely well, in my opinion — they have a great series of blog posts on race, looks and everything else that affects online dating. The man-hours spent on data analysis are well worth the increased pageviews and mindshare. Facebook has a data team as well, but given the quantity of data they have, they could be publishing quite a bit more.
By linkage I refer to connecting the same person across multiple websites. Confusingly, this is sometimes referred to as aggregation. Linkage can take the form of database marketers connecting different databases of personal information, or in the online context, it can take the form of tools that link together individual profiles on different websites.
Pervasive online identities are becoming the norm, which is something I’ve been writing about. All of your online activities are going to be easily linkable sooner or later unless you explicitly take steps to keep your identities separate. But again, users haven’t quite woken up to this yet. Unwanted linkage is therefore something that can upset users greatly. The auto-connect feature in Google Buzz is the best example. Opt-in rather than opt-out is probably the way to go, at least for a few years until everyone gets used to it.
Summary. While well-understood access control principles tell us how to implement the privacy of data marked private, the privacy of “public” data is just as big a concern. So far there has been no systematic way of analyzing exactly what it is that users object to. In this article I’ve presented five such features. To avoid nasty surprises, developers building websites need to think carefully about privacy and user behavior when implementing any of these features.
Thanks to Ann Kilzer for reviewing a draft.
To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.
July 6, 2010 at 7:43 pm