Posts tagged ‘algorithm’
When trying to find someone in an ‘anonymous’ collection of data, two major questions need to be answered:
- Which is the best match among all the data records, given what I know about the person I’m looking for?
- Is the match meaningful, and not an unrelated record that co-incidentally happens to be similar?
The first question is conceptually simple: one needs to come up with a “scoring function” that compares two sets of data and produces a numerical match score. However, this needs to be done with domain-specific knowledge. In the case of the Netflix dataset, for instance, the scoring function incorporated our intuitions about how long a person might take to review a movie after watching it.
The second question is harder, put perhaps somewhat surprisingly, can be done in a domain-independent way. This is the notion of “eccentricity” that we came up with, and it might be of independent interest. During my talks, I could see that there was some confusion and misunderstanding time and again; hence this post.
The concept behind eccentricity is to measure how much the matching record “stands out” from the crowd. One way to do that would be do measure the difference between the top score and the mean score. (As a multiple of the standard deviation. You always need to divide by the standard deviation to derive a dimensionless quantity.)
The problem with this intuitive measure is that in a large enough collection of data, there will always be entries that have a high enough matching score, purely by chance. To be able to model what scores you’d expect by chance, you need to know everything about how people rate movies (or browse the web, or the equivalent in your dataset) and the correlations between them. And that’s clearly impossible.
The trick is to look at the difference between the best match and the second best match as a multiple of the standard deviation. If the scores are distributed according to an exponential distribution (that’s what you’d expect in this case by pure chance, not a Gaussian), then the difference between the top two matches also follows an exponential distribution. That’s a nifty little lemma.
So, if the best match is 10 standard deviations away from the second best, it argues very strongly against the “null hypothesis,” which is that the match occured by fluke. Visually, the results of applying eccentricity are immediately apparent:
Perhaps the reason that eccentricity is at first counterintuitive is that it looks at only two items and throws away the rest of the numbers. But this is common in statistical estimation. Here’s a puzzle that might help: given n samples a1, a2, … an from the interval [0, x], where x is unknown, what is the best predictor of x?
–Select whitespace below for answer–
Answer: max(ai) * (n+1)/n.
In other words, throw away all but one of the samples! Counterintuitive?