Eccentricity Explained

October 3, 2008 at 2:38 am 4 comments

When trying to find someone in an ‘anonymous’ collection of data, two major questions need to be answered:

  • Which is the best match among all the data records, given what I know about the person I’m looking for?
  • Is the match meaningful, and not an unrelated record that co-incidentally happens to be similar?

The first question is conceptually simple: one needs to come up with a “scoring function” that compares two sets of data and produces a numerical match score. However, this needs to be done with domain-specific knowledge. In the case of the Netflix dataset, for instance, the scoring function incorporated our intuitions about how long a person might take to review a movie after watching it.

The second question is harder, put perhaps somewhat surprisingly, can be done in a domain-independent way. This is the notion of “eccentricity” that we came up with, and it might be of independent interest. During my talks, I could see that there was some confusion and misunderstanding time and again; hence this post.

The concept behind eccentricity is to measure how much the matching record “stands out” from the crowd. One way to do that would be do measure the difference between the top score and the mean score. (As a multiple of the standard deviation. You always need to divide by the standard deviation to derive a dimensionless quantity.)

The problem with this intuitive measure is that in a large enough collection of data, there will always be entries that have a high enough matching score, purely by chance. To be able to model what scores you’d expect by chance, you need to know everything about how people rate movies (or browse the web, or the equivalent in your dataset) and the correlations between them. And that’s clearly impossible.

The trick is to look at the difference between the best match and the second best match as a multiple of the standard deviation. If the scores are distributed according to an exponential distribution (that’s what you’d expect in this case by pure chance, not a Gaussian), then the difference between the top two matches also follows an exponential distribution. That’s a nifty little lemma.

So, if the best match is 10 standard deviations away from the second best, it argues very strongly against the “null hypothesis,” which is that the match occured by fluke. Visually, the results of applying eccentricity are immediately apparent:

Perhaps the reason that eccentricity is at first counterintuitive is that it looks at only two items and throws away the rest of the numbers. But this is common in statistical estimation. Here’s a puzzle that might help: given n samples a1, a2, … an from the interval [0, x], where x is unknown, what is the best predictor of x?

–Select whitespace below for answer–

Answer: max(ai) * (n+1)/n.

In other words, throw away all but one of the samples! Counterintuitive?

Entry filed under: Uncategorized. Tags: , , , , .

Article about Netflix paper in law journal Bay Area Visit/Talk Schedule

4 Comments Add your own

  • 1. countrycousin  |  October 3, 2008 at 1:23 pm

    “counterintuitive?”

    Yes. :-{

    Reply
  • 2. Ian Thomas  |  October 3, 2008 at 5:34 pm

    Arnand,

    Thanks for the comment on my blog, which led me here. And thanks for this explanation of eccentricity – it’s some time since I’ve exercised my brain in this way!

    Cheers,
    Ian Thomas

    Reply
  • 3. Raghu M  |  October 7, 2008 at 8:11 pm

    The puzzle is kind of counterintuitive … my first guess was that twice the average would be the best (but this only gives about 1/sqrt(n) accuracy).

    Reply
  • 4. Arvind  |  October 8, 2008 at 3:48 am

    It was certainly very counterintuitive to me when I first came across it, which was long ago. It was an ‘aha’ moment for me; that’s probably when I realized the power of statistical inference and all that. It might very well have subconsciously affected my career choice :-)

    Do you see a proof of optimality of this predictor, though? Why is it better than say, something that looks at the two largest samples?

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


About 33bits.org

I'm an assistant professor of computer science at Princeton. I research (and teach) information privacy and security, and moonlight in technology policy.

This is a blog about my research on breaking data anonymization, and more broadly about information privacy, law and policy.

For an explanation of the blog title and more info, see the About page.

Me, elsewhere

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 248 other followers