## Posts tagged ‘algorithm’

### Graph Isomorphism: Deceptively Hard

* This is the first in a series of loosely related posts that will lead up to an announcement of new research results.*

Asymptotic complexity is often very useful in succintly expressing how hard a problem is. Occasionally, though, it can obfuscate the picture. Graph isomorphism is a perfect example.

Many people mistakenly believe that graph isomorphism (GI) is hard — either NP-hard, or hard enough to be insoluble in practice for large problem sizes. Certainly the complexity of the best known algorithm — *exp(O(sqrt(n log n)))*, due to Babai and Luks — would appear to support this belief. However, the truth is that GI belongs to its own complexity class, not known to be NP-hard nor known to be soluble in polynomial time. More importantly, on real inputs, graph isomorphism is *ridiculously easy*. So easy that real-life solvers can plow through hundreds of thousands of instances per second.

Why is this the case? It turns out that on random graphs, GI is solvable in a very straightforward way. You only need to look at the local structure of each node — due to the randomness, every node has a distinctive neighborhood. Thus, the complexity of the worst case input and the average case input are on opposite ends of a spectrum. To see where “real” graphs fall on this spectrum, note that generative models of real graphs have a regular part and a random part. The regular part is captured in rules such as preferential attachment, but such rules only induce a probability distribution; there is still quite a bit of coin tossing needed to generate each edge. Intuitively, if there is “enough” randomness in the neighborhood of each node, then you only need to look at the local structure to tell different nodes apart. Thus, real-world graphs behave pretty much like random graphs.

It gets better: one of the uses of nauty, the leading graph isomorphism solver, is to *find hard instances* of graph isomorphism. The way this is done is by generating hundreds of candidate hard instances, solving them, and picking the ones that are the hardest to solve. (These are usually highly specialized graphs which are strongly regular and have fearful symmetry groups.) It would appear, then, that finding hard inputs for GI is GI-hard! I wonder if this property can be formally defined, and if there are other known problems for which this is true. This is a similar notion to Impagliazzo’s Heuristica, a world where NP != P, but for any problem in NP, and for inputs drawn from any samplable distribution (i.e, for any “real-world” input), there exists a polynomial time algorithm to decide it.

.

While I find this an interesting theory problem, and would love to hear opinions on it from theorists, my reason for posting this is to point out that with all the current evidence, graph isomorphism can be assumed to be solvable in randomized polynomial time for any input that you would actually encounter in reality.

### Eccentricity Explained

When trying to find someone in an ‘anonymous’ collection of data, two major questions need to be answered:

**Which is the best match**among all the data records, given what I know about the person I’m looking for?**Is the match meaningful**, and not an unrelated record that co-incidentally happens to be similar?

The first question is conceptually simple: one needs to come up with a “**scoring function**” that compares two sets of data and produces a numerical match score. However, this needs to be done with domain-specific knowledge. In the case of the Netflix dataset, for instance, the scoring function incorporated our intuitions about how long a person might take to review a movie after watching it.

The second question is harder, put perhaps somewhat surprisingly, can be done in a domain-independent way. This is the notion of “**eccentricity**” that we came up with, and it might be of independent interest. During my talks, I could see that there was some confusion and misunderstanding time and again; hence this post.

The concept behind eccentricity is to measure how much the matching record “stands out” from the crowd. One way to do that would be do measure the difference between the top score and the mean score. (As a multiple of the standard deviation. You always need to divide by the standard deviation to derive a dimensionless quantity.)

The problem with this intuitive measure is that in a large enough collection of data, there will always be entries that have a high enough matching score, purely by chance. To be able to model what scores you’d expect by chance, you need to know everything about how people rate movies (or browse the web, or the equivalent in your dataset) and the correlations between them. And that’s clearly impossible.

The trick is to look at the **difference between the best match and the second best match** as a multiple of the standard deviation. If the scores are distributed according to an exponential distribution (that’s what you’d expect in this case by pure chance, not a Gaussian), then the difference between the top two matches also follows an exponential distribution. That’s a nifty little lemma.

So, if the best match is 10 standard deviations away from the second best, it argues very strongly against the “null hypothesis,” which is that the match occured by fluke. Visually, the results of applying eccentricity are immediately apparent:

Perhaps the reason that eccentricity is at first counterintuitive is that it looks at only two items and throws away the rest of the numbers. But this is common in statistical estimation. Here’s a puzzle that might help: *given n samples a _{1}, a_{2}, … a_{n} from the interval [0, x], where x is unknown, what is the best predictor of x?*

–Select whitespace below for answer–

Answer: max(a_{i}) * (n+1)/n.

In other words, throw away all but one of the samples! Counterintuitive?

—