The Surprising Effectiveness of Prizes as Catalysts of Innovation

Although the strategic use of prizes to foster sci-tech innovation has a long history, it has exploded in the last two decades—35% annual growth on average, or doubling every 2.3 years.[1] Much has been said on the topic, but I have yet to see a clear answer to the core mystery:

Why do prizes work?

Specifically, why are they more effective than simply hiring people to do it? The question is more complex than it sounds, and a valid explanation must address the following:

  • Why shouldn’t government and industry research funding be switched over entirely to a prize-based model?
  • Why did the prize revolution happen in the last two decades, and not earlier?
  • How do prizes succeed in spite of the massive duplication of effort that you’d expect due to numerous contestants trying to solve the same problem?

Prizes exploit the productivity-reward imbalance

In many fields there is a huge disparity—order of magnitude or more—between the productivity of the top performers and the median performers. The structure of the corporation, having co-evolved with the industrial revolution for harnessing workers to build railroads or textiles, is fundamentally limited in its ability to reward employees in creative endeavors in proportion to their contribution, or even measure it. Academia is a little better due to the precedence of fame over monetary reward, but has its own problems.

Enter prizes. The winner-take-all structure gives individuals or small organizations of exceptional caliber a chance to earn prestige as well as cash that they don’t otherwise have a shot at.

Given that the best innovators are more likely to feel that an academic or corporate job under-rewards them, self-selected prize contestants are likely to skew toward high-performers.

Prizes channel existing research funding

The Netflix prize attracted 34,000 contestants. At an average of just 1 hour (valued at $100) per contestant, the monetary value of the time spent on the contest dwarfs the prize amount. And the majority of contestants—or at least the ones with a serious chance—were already employed as researchers. This effect is broadly true: for example, contestants spent a total of over $100 million in pursuit of the Ansari X Prize which carries a $10 million award.

The real funding for prize-winning efforts comes from Government grants and corporate research labs. The prize itself serves to mainly to legitimize the task as a research goal.

This is in no way meant to be a criticism of prizes—sure, prizes direct attention away from other problems, but one expects that on average, problems for which prizes are offered are more important than others.

Nor does the ability of prizes to spur effort far in excess of the monetary award necessarily mean that contestant behavior is irrational, since the prestige and media attention are typically worth far more than the cash, and because failure to win the prize doesn’t mean the effort is wasted.

That said, the well-known human tendency to systematically overestimate one’s own abilities certainly has a role in explaining the power of prizes to attract talent. According to the same McKinsey report linked above, “many of the participants that we interviewed were absolutely convinced they were going to win [the Ansari X Prize], if not this year, then surely the next.”

What about democratization?

The openness of prizes is often advanced as a key reason for their superiority over traditional research funding. There are two very different components to this assertion: the first is that prizes encourage hybridization of expertise from different fields, given that researchers often fall into the trap of collaborating only within their own communities. There is evidence for this from a study of Innocentive.

The second argument is that prizes allow even non-expert members of the general public, who might otherwise never be involved in research, to participate. I find this argument unconvincing and there is little evidence to support it, if you ignore anecdotes from the 19th century when science funding was meager by today’s standards. However, crowdsourcing to the public seems a good strategy for prizes that are more about problem solving than original research. Challenge.gov may be a good example, depending on how it pans out.

The Internet as an enabler

Now let’s look at the three auxiliary questions I posed above. My explanation for prize effectiveness—self-selection, redirection of funding, and interdisciplinary collaboration—can answer them comfortably. If all research funding were based on prizes, it would defeat the purpose since prizes only serve to redirect existing research funding.

The rapid growth of the sector since 1990 is an obvious indication that the Internet had something to do with it. But how exactly? I think there are several reasons. First, the Internet could be making it easier for experts from different physical locations and/or areas of expertise to team up and to collaborate.

Second, increased reach, shorter cycles and improved economies of scale in most markets in the Internet era have exacerbated the performance-reward imbalance, as well as making the imbalance more obvious to all involved. This is a factor fueling the startup revolution as well.

Finally, and perhaps crucially, I believe the Internet has largely nullified one of the key disadvantages of prizes, which is duplication of effort. The Netflix prize, for one, was marked by a remarkable degree of sharing, and sponsors of new contests are increasingly tweaking the process to ensure that teams build on each other’s ideas.

These factors are only going to accelerate in the future, which suggests that the torrid growth of prizes in number and amount is going to continue for some time to come. There are now many companies dedicated to running these contests—Innocentive is the leader, and Kaggle is a startup focused on the data-mining space. Exciting times.

[1] My numbers are based on this McKinsey report which seems by far the most comprehensive study of prizes and is well worth reading for anyone interested in the subject. The aggregate purse of prizes over $100,000 grew from $50MM to $302MM from 1991 to 2008, during which period the share of “inducement prizes,” the kind we’re concerned with here, showed remarkable growth from 3% of the total to 78%.

Thanks to Steve Lohr for pointers to research when he interviewed me for his NYTimes Bits piece, and to @dan_munz and other Twitter followers for useful suggestions.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

June 6, 2011 at 3:03 pm Leave a comment

Price Discrimination is All Around You

This is the first in a series of articles that will show how we’re at a turning point in the history of price discrimination and discuss the consequences. This article presents numerous examples of traditional price discrimination that you see today, many of which are funny, sad, or downright devious.

Price discrimination, more euphemistically known as differential pricing and dynamic pricing, exploits the fact that in any transaction each customer has a different “willingness to pay.”

What is “willingness to pay,” and how does the seller determine it? To illustrate, let me quote a hilarious story by Steve Blank on selling enterprise software. The protagonist is one Sandy Kurtzig.

Sandy Kurtzig

Since it was the first non-IBM enterprise software on IBM mainframes, [when] she got her first potential order, she didn’t know how to price it. It must have been back in the mid-’70s. She’s [with] this buyer, has a P.O. on his desk, negotiating pricing with Sandy.

So, Sandy said she goes into the buyer who says, “How much is it?”

And Sandy gulped and picked the biggest number she thought anybody would ever rationally pay. And said, “$75,000″. And she said all the buyer did was write down $75,000.

And she realized, shit, she left money on the table. … And she said, “Per year.”

And the buyer wrote down, “Per year.”

And she went, oh, crap what else? She said, “There’s maintenance.”

He said, “How much?”

“25 percent per year.”

And he said, “That’s too much.”

She said, “15 percent.”

And he said, “OK.”

Sadly, not all transactions are as much fun as pricing enterprise software ;-) The price usually has to be determined without meeting the buyer face to face. There are three types of price discrimination based on how the price is determined:

  1. Each buyer is charged a custom price. (Traditionally, there has never been enough data to do this.)
  2. Price depends on an attribute of the buyer such as age or gender.
  3. Different price for different categories of buyers, with the seller somehow getting the buyer to reveal which category they fall into. As we’ll see, hilarity frequently ensues.

Additionally, each buyer may be sold the same product, or it could be customized to each segment—in the extreme case, to each buyer. This is called product differentiation.

Alright. Time to dive into some examples.

1. Student discounts at movies, museums, etc. are one of the simplest types of price discrimination. Students are generally poorer and more price sensitive, so the business hopes to attract more of them by making it cheaper.

Why museums and movies, and not say grocery stores? Two reasons: first, if the grocery store tried it, they’d quickly run into the problem of resale by the group that qualifies for the lower price. (It could manifest as parents sending their kids to get groceries.) The museum doesn’t have this problem because they ask for a student ID.

Second, grocery stores set prices pretty close to their marginal cost anyway, so there’s not as much of a scope for variable pricing. With museums, on the other hand, it costs them next to nothing to admit an extra visitor. All of their costs are fixed costs.

Prevention of resale and low marginal costs relative to fixed costs are two important ingredients for price discrimination.

2. Ladies’ night at bars is another simple example of price discrimination based on an attribute (gender). Rather than women having a lower willingness to pay, it is perhaps more accurate to say that men are more desperate to get in :-)

Interestingly, this is one of the few examples whose legality is questionable. Wikipedia has a good survey. Also, it is not a “pure” example since the point of ladies’ night is not just to get more women through the door but also, indirectly, to get more men through the door.

3. A less obvious example is the variation of gas prices (and other commodities) within the same chain across locations. This is because people in richer ZIP codes are willing to pay more on average.

An important caveat: some of the variation is typically explainable by differences in marginal cost (such as rent) between different locations, but not all of it.

4. Financial aid at universities is a rather complex case of price discrimination. Instead of charging different rates to different students, the seller has a base rate and gives discounts (aid) to qualifying students.

Discounting is a frequently used form of “concealed” price discrimination.

You can see aid programs in humanitarian/political terms or in economic terms; the two paradigms are not in conflict with each other. In the economic view, students with higher scores receive aid because they have more college options and are therefore more price-sensitive. Poorer students and minorities receive aid because they are less able/willing to pay.

In the examples so far, the attribute(s) that factor into discrimination are either obvious (gender, race, location) or it is in the buyer’s interest to disclose them to the seller (student status, financial need). Now let’s look at examples where the seller has to be crafty in getting the buyer to disclose it.

5. Car prices vary greatly between market segments, far more than can be explained by differences in marginal cost. Car buyers segment themselves because owning a higher-end car is a status symbol.

Product differentiation is frequently used to get buyers to segment themselves.

The same principle applies to numerous other product categories like wine and coffee. But at least you’re getting at least a nominally superior product for a higher price. Let’s look at examples where buyers voluntarily pay more for the same product.

6. Dell.com used to ask customers if they were home users, small businesses, or other categories. The prices for the same products varied according to the category you declared. There was no legally binding reason to be honest about your disclosure, and no enforcement mechanism.

Now for a more devious example.

7. “Staples brazenly sends out different office supply catalogs with different prices to the same customers. The price-sensitive buyers know which to buy from. The inattentive ones pay extra.” [source]

A similar example: restaurants with long menus sometimes highlight some popular choices on the first page. The same items are available in the long-form menu for cheaper, if only you knew where they’re buried.

These examples illustrate an extremely common form of price discrimination:

Buyers who are willing to jump through hoops demonstrate their high price-sensitivity and therefore get lower prices.

This theme is so fundamental that it has been practiced for thousands of years in the form of haggling.

8. The jumping-through-hoops principle suggests that it makes economic sense for the seller to make discounts hard to get. Nowhere is this more apparent than with Black Friday deals—stand in ridiculously long lines all night to get fabulous discounts. Wealthier customers who don’t bother doing so will get much less of a discount during regular store hours, even on Black Friday.

9. More examples of hard-to-get discounts: woot.com, mailing-list deals and Southwest Airlines DING. Many of these involve artificial scarcity and time-limitations to make them more difficult to get, thus ensuring that those who take advantage are buyers who might otherwise not buy at all.

10. Perhaps the most extreme example of roping in buyers who might otherwise not buy is deliberately crippling your own product, known in economics as damaged goods.

IBM did this with its popular LaserPrinter by adding chips that slowed down the printing to about half the speed of the regular printer. The slowed printer sold for about half the price, under the IBM LaserPrinter E name.

That example and more like it are from here. And a more poignant example from railways of long ago:

It is not because of the few thousand francs which would have to be spent to put a roof over the third-class carriages or to upholster the third-class seats that some company or other has open carriages with wooden benches. What the company is trying to do is to prevent the passengers who can pay the second class fare from traveling third class; it hits the poor, not because it wants to hurt them, but to frighten the rich. And it is again for the same reason that the companies, having proved almost cruel to the third-class passengers and mean to the second-class ones, become lavish in dealing with first-class passengers. Having refused the poor what is necessary, they give the rich what is superfluous.

These examples should make clear that:

Getting buyers to reveal their willingness to pay often has signficant social costs.

11. There are endless examples of clever tricks to learn the customer’s price-sensitivity in the airline industry. The price for the same seat can vary greatly depending on a variety of factors. The most well-known one is that you get lower prices if your trip spans a weekend, because it probably means you’re not a business traveler.

12. First class and business class seating on airlines is also price discrimination, but of a very different kind. Here it’s not different prices for the same product but different prices for slightly different products. Buyers segment themselves due to product differentiation, a phenomenon we’ve seen before with cars.

The first class/economy price spread can often be as high as 10x, which illustrates the wide range of customers’ willingness to pay. For a variety of reasons, most other markets haven’t managed to attain such a high price spread.

The “holy grail” of price discrimination is to achieve dramatically higher price spreads in most markets.

Aaaaaand we’re done with the examples!

Note that this is far from a complete list—I haven’t covered clearance sales, loyalty programs and frequent flyer miles, hi-lo pricing, drug prices that vary by country, and so forth, but I hope I’ve convinced you that price discrimination in some form already happens in nearly every market.

But here’s the kicker: I’ve deliberately left out what I consider the most important class of examples, because I’m going to devote a whole article to it. I will argue that this emerging form of price discrimination is going to explode in popularity and dwarf anything we’ve seen so far. Feel free to guess what I’m thinking about in the comments, and stay tuned!

Many thanks to Justin Brickell, Alejandro Molnar and Adam Bossy for useful discussions and comments. Thanks also to my Twitter followers for putting up with my ‘tweetathon’ on this topic two months ago and providing feedback.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

June 2, 2011 at 2:48 pm 6 comments

“You Might Also Like:” Privacy Risks of Collaborative Filtering

I have a new paper titled “You Might Also Like:” Privacy Risks of Collaborative Filtering with Joe Calandrino, Ann Kilzer, Ed Felten and Vitaly Shmatikov. We developed new “statistical inference” techniques and used them to show how the public outputs of online recommender systems, such as the “You Might Also Like” lists you see on many websites, can reveal individual purchases and preferences. Joe spoke about it at the IEEE S&P conference at Oakland earlier today.

Background: inference and statistical inference. The paper is about techniques for inference. At its core, inference is a simple concept, and is about deducing that some event has occured based on its effect on other observable events or objects, often seemingly unrelated. Think Sherlock Holmes, whether something simple such as the idea of a smoking gun, now so well known that it’s a cliché, or something more subtle like the curious incident of the dog in the night time.

Today, inference has evolved a great deal, and in our data-rich world, inference often means statistical inference. Detection of extrasolar planets is a good example of making deductions from the faintest clues: A planet orbiting a star makes the star wobble slightly, which affects the velocity of the star with respect to the Earth. And this relative velocity can be deduced from the displacement in the parent star’s spectral lines due to the Doppler effect, thus inferring the existence of a planet. Crazy!

Web privacy. But back to the paper: what we did was to develop and apply inference techniques in the web context, specifically recommender systems, in a way that no one had thought of before. As you may have noticed, just about every website publicly shows relationships between related items—products, videos, books, news articles, etc.— and these relationships are derived from purchases or views, which are private information. What if the public listings could be reverse engineered, so that we can infer a user’s purchases from them? As the abstract says:

Many commercial websites use recommender systems to help customers locate products and content. Modern recommenders are based on collaborative filtering: they use patterns learned from users’ behavior to make recommendations, usually in the form of related-items lists. The scale and complexity of these systems, along with the fact that their outputs reveal only relationships between items (as opposed to information about users), may suggest that they pose no meaningful privacy risk.

In this paper, we develop algorithms which take a moderate amount of auxiliary information about a customer and infer this customer’s transactions from temporal changes in the public outputs of a recommender system. Our inference attacks are passive and can be carried out by any Internet user.  We evaluate their feasibility using public data from popular websites Hunch, Last.fm, LibraryThing, and Amazon.

The screenshot below shows an example of a related-items list on Amazon. There are up to 100 items in such lists.

Consider a user Alice who’s made numerous purchases, some of which she has reviewed publicly. Now she makes a new purchase which she considers sensitive. But this new item, because of her purchasing it, has a nonzero probability of entering the “related items” list of each of the items she has purchased in the past, including the ones she has reviewed publicly. And even if it is already in the related-items list of some of those items, it might improve its rank on those lists because of her purchase. By aggregating dozens or hundreds of these observations, the attacker has a chance of inferring that Alice purchased something, as well as the identity of the item she purchased.

It’s a subtle technique, and the paper has more details than you can shake a stick at if you want to know more.

We evaluated the attacks we developed against several websites of a diverse nature. Numerically, our best results are against Hunch, a recommendation and personalization website. There is a tradeoff between the number of inferences and their accuracy. When optimized for accuracy, our algorithm inferred a third of the test users’ secret answers to Hunch questions with no error. Conversely, if asked to predict the secret answer to every secret question, the algorithm had an accuracy of around 80%.

Impact. It is important to note that we’re not claiming that these sites have serious flaws, or even, in most cases, that they should be doing anything different. On sites other than Hunch—Hunch had an API that provided exact numerical correlations between pairs of items—our attacks worked only on a small proportion of users, although it is sufficient to demonstrate the concept. (Hunch has since eliminated this feature of the API, for reasons unrelated to our research.) We also found that users of larger sites are much safer, because the statistical aggregates are computed from a larger set of users.

But here’s why we think this paper is important:

  • Our attack applies to a wide variety of sites—essentially every site with an online catalog of some sort. While we discuss various ways to mitigate the attack in the paper, there is no bulletproof “fix.”
  • It undermines the widely accepted dichotomy between “personally identifiable” individual records and “safe,” large-scale, aggregate statistics. Furthermore, it demonstrates that the dynamics of aggregate outputs (i.e., their variation with time) constitute a new vector for privacy breaches. Dynamic behavior of high-dimensional aggregates like item similarity lists falls beyond the protections offered by any existing privacy technology, including differential privacy.
  • It underscores the fact that modern systems have vast “surfaces” for attacks on privacy, making it difficult to protect fine-grained information about their users. Unintentional leaks of private information are akin to side-channel attacks: it is very hard to enumerate all aspects of the system’s publicly observable behavior which may reveal information about individual users.

That last point is especially interesting to me. We’re leaving digital breadcrumbs online all the time, whether we like it or not. And while algorithms to piece these trails together might seem sophisticated today, they will probably look mundane in a decade or two if history is any indication. The conversation around privacy has always centered around the assumption that we can build technological tools to give users—at least informed users—control over what they reveal about themselves, but our work suggests that there might be fundamental limits to those tools.

See also: Joe Calandrino’s post about this paper.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

May 24, 2011 at 6:11 pm 4 comments

Insights on fighting “Protect IP” from a Q&A with Congresswoman Lofgren

Summary. Appeals to free speech and chilling effects are at best temporary measures in the fight against Protect IP and domain seizures. Even if we win this time it will keep coming back in modified form; the only way defeat it for good is to convince Washington that artists are in fact thriving, that piracy is not the real problem, and that takedown efforts are not in the interest of society. We in the tech world know this, but we are doing a poor job of making ourselves heard in Washington, and this needs to change.

As most of you know, the Protect IP Act is a horrendous piece of proposed legislation sponsored by the “content industry” that gives branches of the Government powers to sieze domain names at will, force websites to remove links, etc. Congresswoman Zoe Lofgren has been one of the very few legislators fighting the good fight, speaking out against this grave threat to free speech.

I was invited to a brown bag lunch with Rep. Lofgren at Mozilla today. (Mozilla has gotten involved in this because of the events surrounding the Mafiaafire add-on and Homeland Security.) I asked the Congresswoman this question (paraphrased):

“Does the strategy of domain-name seizures even have a prayer of achieving the intended outcome, or is it going to lead to something similar to the Streisand effect, as we’ve seen happen repeatedly on the Internet? Tools for circumvention of censorship in dictatorial regimes, that we can all get behind and that the U.S. government has often funded, may be morally different from tools for circumvention of anti-infringement efforts, but they are technologically identical.” [Princeton professor and now FTC chief technologist Ed Felten has pointed this out in a related context.]

In response, Rep. Lofgren pivoted to the point that seemed to be her favorite theme of the day—the tech world needs to come up with ways to monetize online content, she said. Unless that happens, it’s not looking good for our side in the long run.

At first I was slightly annoyed by her not addressing my question, but after she pivoted a couple of more times to the same point in answer to other questions I started to pay close attention.

What the Congresswoman was saying was this:

  1. The only way to convince Washington to drop this issue for good is to show that artists and musicians can get paid on the Internet.
  2. Currently they are not seeing any evidence of this. The Congresswoman believes that new technology needs to be developed to let artists get paid. I believe she is entirely wrong about this; see below.
  3. The arguments that have been raised by tech companies and civil liberties groups in Washington all center around free speech; there is nothing wrong with that but it is not a viable strategy in the long run because the issue is going to keep coming back.

Let’s zoom in on point 2 above. We techies all say we have the answers. New technology is not needed, we say. The dinosaurs of the content industries need to adapt their business models. Piracy is not correlated with a decrease in sales. Piracy happens not because it is cheaper, but because it is more convenient. Businesses need to compete with piracy rather than trying to outlaw it. Artists who’ve understood this are already thriving.

Washington is willing to listen to this. But no one is telling it to them.

There are a million blog posts that make the points above. But those don’t have an impact in Congress. “You vote up articles on Reddit all day,” Rep. Lofgren said. “Guess what, we don’t check Reddit in Washington.” Yes, she actually said that. The exact wording might be off but she used words to essentially that effect. She also pointed out that the tech industry spends by far the least amount of effort on lobbying. The entire industry has fewer representatives, apparently, than individual companies from many other sectors do.

A lot of information that we consider common knowledge is not available in Washington. It needs to be in a digestible form; for example, academic studies with concrete numbers that can be cited will be particularly useful. But a simple and important first step is to start communicating with policymakers. In my dealings with them, I’ve found them more willing to listen than I would have thought. So here’s my plea to the community to redirect some of the energy that we expend writing blog posts and expressing outrage into something more constructive.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

May 19, 2011 at 10:50 pm Leave a comment

A Map of 33bits.org

Moved here.

May 9, 2011 at 10:11 pm

The Master Switch and the Centralization of the Internet

One of the most important trends in the recent evolution of the Internet has been the move towards centralization and closed platforms. I’m interested in this question in the context of social networks—analyzing why no decentralized social network has yet taken off, whether one ever will, and whether a decentralized social network is important for society and freedom. With this in mind, I read Tim Wu’s ‘The Master Switch: The Rise and Fall of Information Empires,’ a powerful book that will influence policy debates for some time to come. My review follows.

‘The Master Switch’ has two parts. The former discusses the history of communications media through the twentieth century and shows evidence for “The Cycle” of open innovation → closed monopoly → disruption. The latter, shorter part is more speculative and argues that the same fate will befall the Internet, absent aggressive intervention.

The first part of the book is unequivocally excellent. There are so many grand as well as little historical facts buried in there. Wu makes his case well for the claim that radio, telephony, film and television have all taken much the same path.

A point that Wu drives home repeatedly is that while free speech in law is always spoken of in the context of Governmental controls, the private entities that own or control the medium of speech play a far bigger role in practice in determining how much freedom of speech society has. In the U.S., we are used to regulating Governmental barriers to speech but not private ones, and a lot of the book is about exposing the problems with this approach.

An interesting angle the author takes is to look at the motives of the key men that shaped the “information industries” of the past. This is apposite given the enormous impact on history that each of these few has had, and I felt it added a layer of understanding compared to a purely factual account.

But let’s cut to the chase—the argument about the future of the Internet. I wasn’t sure whether I agreed or disagreed until I realized Wu is making two different claims, a weak one and a strong one, and does not separate them clearly.

The weak claim is simply that an open Internet is better for society in the long run than a closed one. Open and closed here are best understood via the exemplars of Google and Apple. Wu argues this reasonably well, and in any case not much argument is needed—most of us would consider it obvious on the face of it.

The strong claim, and the one that is used to justify intervention, is that a closed Internet will have such crippling effects on innovation and such chilling effects on free speech that it is our collective duty to learn from history and do something before the dystopian future materializes. This is where I think Wu’s argument falls short.

To begin with, Wu doesn’t have a clear reason why the Internet will follow the previous technologies, except, almost literally, “we can’t be sure it won’t.” He overstates the similarities and downplays the differences.

Second, I believe Wu doesn’t fully understand technology and the Internet in some key ways. Bizarrely, he appears to believe that the Internet’s predilection for decentralization is due to our cultural values rather than technological and business realities prevalent when these systems were designed.

Finally, Wu has a tendency to see things in black and white, in terms of good and evil, which I find annoying, and more importantly, oversimplified. He quotes this sentence approvingly: “Once we replace the personal computer with a closed-platform device such as the iPad, we replace freedom, choice and the free market with oppression, censorship and monopoly.” He also says that “no one denies that the future will be decided by one of two visions,” in the context of iOS and Android. It isn’t clear why he thinks they can’t coexist the way the Mac and PC have.

Regardless of whether one buys his dystopian prognostications, Wu’s paradigm of the “separations principle” is to be taken seriously. It is far broader than even net neutrality. There appear to be two key pillars: a separation of platforms and content, and limits on corporate structures to faciliate this—mainly vertical, but also horizontal, such as in the case of media conglomerates.

Interestingly, Wu wants the separations principle to be more of a societal-corporate norm than Governmental regulation. That said, he does call for more powers to the FCC, which is odd given that he is clear on the role that State actors have played in the past in enabling and condoning monopoly abuse:

Again and again in the histories I have recounted, the state has shown itself an inferior arbiter of what is good for the information industries. The federal government’s role in radio and television from the 1920s to the 1960s, for instance, was nothing short of a disgrace. In the service of chain broadcasting, it wrecked a vibrant, decentralized AM marketplace. At the behest of the ascendant radio industry, it blocked the arrival and prospects of FM radio, and then it put the brakes on television, reserving it for the NBC-CBS duopoly. Finally, from the 1950s through the 1960s, it did everything in its power to prevent cable television from challenging the primacy of the networks.

To his credit, Wu does seem to be aware of the contradiction, and appears to argue that the Government agencies can learn and change. It does seem like a stretch, however.

In summary, Wu deserves major kudos both for the historical treatment and for some very astute insights about the Internet. For example, in the last 2-3 years, Apple, Facebook, and Twitter have all made dramatic moves toward centralization, control and closed platforms. Wu seems to have foreseen this general trend more clearly than most techies did.[1] The book does have drawbacks, and I don’t agree that the Internet will go the way of past monopolies without intervention. It should be very interesting to see what moves Wu will make now that he will be advising the FTC.

[1] While the book was published in late 2010, I assume that Wu’s ideas are much older.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

March 23, 2011 at 7:51 pm Leave a comment

Privacy and the Market for Lemons, or How Websites Are Like Used Cars

I had a fun and engaging discussion on the “Paying With Data” panel at the South by Southwest conference; many thanks to my co-panelists Sara Marie Watson, Julia Angwin and Sam Yagan. I’d like to elaborate here on a concept that I briefly touched upon during the panel.

The market for lemons

In a groundbreaking paper 40 years ago, economist George Akerlof explained why so many used cars are lemons. The key is “asymmetric information:” the seller of a car knows more about its condition than the buyer does. This leads to “adverse selection” and a negative feedback spiral, with buyers tending to assume that there are hidden problems with cars on the market, which brings down prices and disincentivizes owners of good cars from trying to sell, further reinforcing the perception of bad quality.

In general, a market with asymmetric information is in danger of developing these characteristics: 1. buyers/consumers lack the ability to distinguish between high and low quality products 2. sellers/service providers lose the incentive to focus on quality and 3. the bad gradually crowds out the good since poor-quality products are cheaper to produce.

Information security and privacy suffer from this problem at least as much as used cars do.

The market for security products and certification

Bruce Schneier describes how various security products, such as USB drives, have turned into a lemon market. And in a fascinating paper, Ben Edelman analyzes data from TRUSTe certifications and comes to some startling conclusions [emphasis mine]:

Widely-used online “trust” authorities issue certifications without substantial verification of recipients’ actual trustworthiness. This lax approach gives rise to adverse selection: The sites that seek and obtain trust certifications are  actually less trustworthy than others. Using a new dataset on web site safety, I demonstrate that sites certified by the best-known authority, TRUSTe, are more than twice as likely to be untrustworthy as uncertified sites. This difference remains statistically and economically significant when restricted to “complex” commercial sites.

[...]
In a 2004 investigation after user complaints, TRUSTe gave Gratis Internet a clean bill of health. Yet subsequent New York Attorney General litigation uncovered Gratis’ exceptionally far-reaching privacy policy violations — selling 7.2 million users’ names, email addresses, street addresses, and phone numbers, despite a privacy policy exactly to the contrary.

[...]
TRUSTe’s “Watchdog Reports” also indicate a lack of focus on enforcement. TRUSTe’s postings reveal that users continue to submit hundreds of complaints each month. But of the 3,416 complaints received since January 2003, TRUSTe concluded that not a single one required any change to any member’s operations, privacy statement, or privacy practices, nor did any complaint require any revocation or on-site audit. Other aspects of TRUSTe’s watchdog system also indicate a lack of diligence.

The market for personal data

In the realm of online privacy and data collection, the information asymmetry results from a serious lack of transparency around privacy policies. The website or service provider knows what happens to data that’s collected, but the user generally doesn’t. This arises due to several economic, architectural, cognitive and regulatory limitations/flaws:

  • Each click is a transaction. As a user browses around the web, she interacts with dozens of websites and performs hundreds of actions per day. It is impossible to make privacy decisions with every click, or have a meaningful business relationship with each website, and hold them accountable for their data collection practices.
  • Technology is hard to understand. Companies can often get away with meaningless privacy guarantees such as “anonymization” as a magic bullet, or “military-grade security,” a nonsensical term. The complexity of private browsing mode has led to user confusion and a false sense of safety.
  • Privacy policies are filled with legalese and no one reads them, which means that disclosures made therein count for nothing. Yet, courts have upheld them as enforceable, disincentivizing websites from finding ways to communicate more clearly.

Collectively, these flaws have led to a well-documented market failure—there’s an arms race to use all means possible to entice users to give up more information, as well as to collect it passively through ever-more intrusive means. Self-regulatory organizations become captured by those they are supposed to regulate, and therefore their effectiveness quickly evaporates.

TRUSTe seems to be up to some shenanigans the online tracking space as well. As many have pointed out, the TRUSTe “Tracking Protection List” for Internet Explorer is in fact a whitelist, allowing about 4,000 domains—almost certainly from companies that have paid TRUSTe—to track the user. Worse, installing the TRUSTe list seems to override the blocking of a domain via another list!

Possible solutions

The obvious response to a market with asymmetric information is to correct the information asymmetry—for used cars, it involves taking it to a mechanic, and for online privacy, it is consumer education. Indeed, the What They Know series has done just that, and has been a big reason why we’re having this conversation today.

However, I am skeptical that the market can be fixed though consumer awareness alone. Many of the factors I’ve laid out above involve fundamental cognitive limitations, and while consumers may be well-educated about the general dangers prevalent online, it does not necessarily help them make fine-grained decisions.

It is for these reasons that some sort of Government regulation of the online data-gathering ecosystem seems necessary. Regulatory capture is of course still a threat, but less so than with self-regulation. Jonathan Mayer and I point out in our FTC Comment that ad industry self-regulation of online tracking has been a failure, and argue that the FTC must step in and enforce Do Not Track.

In summary, information asymmetry occurs in many markets related to security and privacy, leading in most cases to a spiraling decline in quality of products and services from a consumer perspective. Before we can talk about solutions, we must clearly understand why the market won’t fix itself, and in this post I have shown why that’s the case.

Update. TRUSTe president Fran Maier responds in the comments.

Update 2. Chris Soghoian points me to this paper analyzing privacy economics as a lemon market, which seems highly relevant.

Thanks to Jonathan Mayer for helpful feedback.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

March 18, 2011 at 4:37 pm 6 comments

Link Prediction by De-anonymization: How We Won the Kaggle Social Network Challenge

The title of this post is also the title of a new paper of mine with Elaine Shi and Ben Rubinstein. You can grab a PDF or a web-friendly HTML version generated using my Project Luther software.

A brief de-anonymization history. As early as the first version of my Netflix de-anonymization paper with Vitaly Shmatikov back in 2006, a colleague suggested that de-anonymization can in fact be used to game machine-learning contests—by simply “looking up” the attributes of de-anonymized users instead of predicting them. We off-handedly threw in paragraph in our paper discussing this possibility, and a New Scientist writer seized on it as an angle for her article.[1] Nothing came of it, of course; we had no interest in gaming the Netflix Prize.

During the years 2007-2009, Shmatikov and I worked on de-anonymizing social networks. The paper that resulted (PDF, HTML) showed how to take two graphs representing social networks and map the nodes to each other based on the graph structure alone—no usernames, no nothing. As you might imagine, this was a phenomenally harder technical challenge than our Netflix work. (Backstrom, Dwork and Kleinberg had previously published a paper on social network de-anonymization; the crucial difference was that we showed how to put two social network graphs together rather than search for a small piece of graph-structured auxiliary information in a large graph.)

The context for these two papers is that data mining on social networks—whether online social networks, telephone call networks, or any type of network of links between individuals—can be very lucrative. Social networking websites would benefit from outsourcing “anonymized” graphs to advertisers and such; we showed that the privacy guarantees are questionable-to-nonexistent since the anonymization can be reversed. No major social network has gone down this path (as far as I know), quite possibly in part because of the two papers, although smaller players often fly under the radar.

The Kaggle contest. Kaggle is a platform for machine learning competitions. They ran the IJCNN social network challenge to promote research on link prediction. The contest dataset was created by crawling an online social network—which was later revealed to be Flickr—and partitioning the obtained edge set into a large training set and a smaller test set of edges augmented with an equal number of fake edges. The challenge was to predict which edges were real and which were fake. Node identities in the released data were obfuscated.

There are many, many anonymized databases out there; I come across a new one every other week. I pick de-anonymization projects if it will advance the art significantly (yes, de-anonymization is still partly an art), or if it is fun. The Kaggle contest was a bit of both, and so when my collaborators invited me to join them, it was too juicy to pass up.

The Kaggle contest is actually much more suitable to game through de-anonymization than the Netflix Prize would have been. As we explain in the paper:

One factor that greatly affects both [the privacy risk and the risk of gaming]—in opposite directions—is whether the underlying data is already publicly available. If it is, then there is likely no privacy risk; however, it furnishes a ready source of high-quality data to game the contest.

The first step was to do our own crawl of Flickr; this turned out to be relatively easy. The two graphs (the Kaggle graph and our Flickr crawl), were 95% similar, as we were later able to determine. The difference is primarily due to Flickr users adding and deleting contacts between Kaggle’s crawl and ours. Armed with the auxiliary data, we set about the task of matching up the two graphs based on the structure. To clarify: our goal was to map the nodes in the Kaggle training and test dataset to real Flickr nodes. That would allow us to simply look  up the pairs of nodes in the test set in the Flickr graph to see whether or not the edge exists.

De-anonymization. Our effort validated the broad strategy in my paper with Shmatikov, which consists of two steps: “seed finding” and “propagation.” In the former step we somehow de-anonymize a small number of nodes; in the latter step we use these as “anchors” to propagate the de-anonymization to more and more nodes. In this step the algorithm feeds on its own output.

Let me first describe propagation because it is simpler.[2] As the algorithm progresses, it maintains a (partial) mapping between the nodes in the true Flickr graph and the Kaggle graph. We iteratively try to extend the mapping as  follows: pick an arbitrary as-yet-unmapped node in the Kaggle graph, find the “most similar” node in the Flickr graph, and if they are “sufficiently similar,” they get mapped to each other.

Similarity between a Kaggle node and a Flickr node is defined as cosine similarity between the already-mapped neighbors of the Kaggle node and the already-mapped neighbors of the Flickr node (nodes mapped to each other are treated as identical for the purpose of cosine comparison).

In the diagram, the blue  nodes have already been mapped. The similarity between A and B is 2 / (√3·√3) =  ⅔. Whether or not edges exist between A and A’ or B and B’ is irrelevant.

There are many heuristics that go into the “sufficiently similar” criterion, which are described in our paper. Due to the high percentage of common edges between the graphs, we were able to use a relatively pure form of the propagation algorithm; the one my paper with Shmatikov, in contast, was filled with lots more messy heuristics.

Those elusive seeds. Seed identification was far more challenging. In the earlier paper, we didn’t do seed identification on real graphs; we only showed it possible under certain models for error in auxiliary information. We used a “pattern-search” technique, as did the Backstrom et al paper uses a similar approach. It wasn’t clear whether this method would work, for reasons I won’t go into.

So we developed a new technique based on “combinatorial optimization.” At a high level, this means that instead of finding seeds one by one, we try to find them all at once! The first step is to find a set of k (we used k=20) nodes in the Kaggle graph and k nodes in our Flickr graph that are likely to correspond to each other (in some order); the next step is to find this correspondence.

The latter step is the hard one, and basically involves solving an NP-hard problem of finding a permutation that minimizes a certain weighting function. During the contest I basically stared at this page of numbers for a couple of hours, and then wrote down the mapping, which to my great relief turned out to be correct! But later we were able to show how to solve it in an automated and scalable fashion using simulated annealing, a well-known technique to approximately solve NP-hard problems for small enough problem sizes. This method is one of the main research contributions in our paper.

After carrying out seed identification, and then propagation, we had de-anonymized about 65% of the edges in the contest test set and the accuracy was about 95%. The main reason we didn’t succeed on the other third of the edges was that one or both the nodes had a very small number of contacts/friends, resulting in too little information to de-anonymize. Our task was far from over: combining de-anonymization with regular link prediction also involved nontrivial research insights, for which I will again refer you to the relevant section of the paper.

Lessons. The main question that our work raises is where this leaves us with respect to future machine-learning contests. One necessary step that would help a lot is to amend contest rules to prohibit de-anonymization and to require source code submission for human verification, but as we explain in the paper:

The loophole in this approach is the possibility of overfitting. While source-code verification would undoubtedly catch a contestant who achieved their results using de-anonymization alone, the more realistic threat is that of de-anonymization being used to bridge a small gap. In this scenario, a machine learning algorithm would be trained on the test set, the correct results having been obtained via de-anonymization. Since successful [machine learning] solutions are composites of numerous algorithms, and consequently have a huge number of parameters, it should be possible to conceal a significant amount of overfitting in this manner.

As with the privacy question, there are no easy answers. It has been over a decade since Latanya Sweeney’s work provided the first dramatic demonstration of the privacy problems with data anonymization; we still aren’t close to fixing things. I foresee a rocky road ahead for machine-learning contests as well. I expect I will have more to say about this topic on this blog; stay tuned.

[1] Amusingly, it was a whole year after that before anyone paid any attention to the privacy claims in that paper.

[2] The description is from my post on the Kaggle forum which also contains a few additional details.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

March 9, 2011 at 12:30 pm 4 comments

In Which I Interrupt Your Regularly Scheduled Programming to Talk about Immigration Policy

On a recent trip to India for the winter break, I needed to renew my US visa. Like many people working on computer security and other subjects on the “Technology Alert List,” I ended up getting stuck there while my application was sent back to the US Department of State, where they supposedly make sure I’m not conducting espionage.I was lucky—I was “only” delayed by a little over a month. (I’m told that the wait used to be several months, and applicants would often give up.) Nonetheless, it was hugely disrputive: I missed a conference where I was supposed to speak, multiple panels and innumerable meetings.

There are several absurd aspects to the way the State Department and the Consulate process these applications:

  • Processing takes a highly variable amount of time. If it always took a month it wouldn’t be nearly as bad, but since it sometimes takes several months, it wrecks your ability to schedule things.
  • The consulate is highly understaffed. A decision to reject an applicant or stick them in limbo is made based on a 1-2 minute interview.
  • I’ve already been in the country for 6.5 years. Besides, my leaving the country was entirely voluntary, and I’m not required to renew my visa unless I do choose to leave voluntarily. One would think that if I were up to something I would have done it by now, or at least not have left.
  • There is no way to get this time-consuming background check done while I’m still in the country.
  • All of this would be justifiable in some way if the system at least worked. But the determination of whether an applicant working on something sensitive is entirely dependent on what they put on their application; worse, it’s based on keyword matching. It is often possible to reword your application to avoid these keywords if you know how; I wasn’t smart enough to do so.

Immigrants are not the only ones harmed by the muddleheaded visa policy and the fickle behavior of the visa overlords—all Americans are. The H-1B lottery, processing delays and other visa problems contribute to turning skilled workers and scientists back home, which hurts the economy. In fact the US spends taxpayer money to educate Ph.D’s and then encourages or forces them to leave.

As with many problems of Government, a major factor here seems to be that there is a vast and bloated immigration apparatus mired in rules and with no central oversight. Are there things an ordinary person can do to help improve the situation? I’d welcome any thoughts on the issue.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

February 23, 2011 at 4:31 pm 6 comments

One Click Frauds and Identity Leakage: Two Trends on a Collision Course

One of my favorite computer security papers of 2010 is by Nicolas Christin, Sally Yanagihara and Keisuke Kamataki on “one click frauds,” a simple yet shockingly effective form of social engineering endemic to Japan. I will let the authors explain:

In the family apartment in Tokyo, Ken is sitting at his computer, casually browsing the free section of a mildly erotic website. Suddenly, a window pops up, telling him,

Thank you for your patronage! You successfully registered for our premium online services, at an incredible price of 50,000 JPY. Please promptly send your payment by bank transfer to ABC Ltd at Ginko Bank, Account 1234567. Questions? Please contact us at 080-1234-1234.

Your IP address is 10.1.2.3, you run Firefox 3.5 over Windows XP, and you are connecting from Tokyo.

Failure to send your payment promptly will force us to mail you a postcard reminder to your home address. Customers refusing to pay will be prosecuted to the fullest extent of the law. Once again, thank you for your patronage!

A sample postcard reminder is shown on the screen, and consists of a scantily clad woman in a provocative pose. Ken has a sudden panic attack: He is married, and, if his wife were to find out about his browsing habits, his marriage would be in trouble, possibly ending in divorce, and public shame. In his frenzied state of mind, Ken also fears that, if anybody at his company heard about this, he could possibly lose his job. Obviously, those website operators know who he is and where he lives, and could make his life very difficult. Now, 50,000 JPY (USD 500) seems like a small price to pay to make all of this go away. Ken immediately jots down the contact information, goes to the nearest bank, and acquits himself of his supposed debt.

Ken has just been the victim of a relatively common online scam perpetrated in Japan, called “One Click Fraud.” In this fraud, the “customer,” i.e., the victim, does not enter any legally binding agreement, and the perpetrators only have marginal information about the client that connected to their website (IP address, User-Agent string), which does not reveal much about the user. However, facing a display of authority stressed by the language used, including the notion that they are monitored, and a sense of shame from browsing sites with questionable contents, most victims do not realize they are part of an extortion scam. Some victims even call up the phone numbers provided, and, in hopes of resolving the situation, disclose private information, such as name or address, to their tormentors, which makes them even more vulnerable to blackmail.

As a result, One Click Frauds have been very successful in Japan. Annual police reports show that the estimated amount of monetary damages stemming from One Click Frauds and related confidence scams are roughly 26 billion JPY per year (i.e., USD 260 million/year). [emphasis mine]

The authors offer a fascinating economic analysis based on a near-exhaustive collection of fraud reports over a several-year period. Each scam offers 3 types of data points: the domain name where the scam appeared, the phone number the victim is asked to call, and the bank account number where the money is asked to be deposited. They plot the graph of all links between the ~500 domains, ~700 bank accounts and ~200 phone numbers, and report, among other nifty findings, that at most 13 groups are responsible for over half of all one-click frauds. Based on simple cost estimates, they also find that for each scam operated, the scammers recover their costs (bank account fee, bandwidth, etc.) with as few as 4 victims per year.

In this post I want to talk about the possible evolution of one-click frauds. At some point, either due to public awareness campaigns or due to saturation, the Japanese public will catch on to the fact that the attempted blackmail is fake and that the websites don’t actually have their identity. When this happens the scammers will be forced to up their game. Another impetus for increasing sophistication is making the fraud work outside Japan—the current version probably won’t work; the instinctive obedience of apparent authority seems characteristically Japanese.

And by ‘up their game,’ I mean that the scammers will probably get wise to the fact that they can discover the victim’s actual identity, and establish a credible threat instead of a fake one.

Readers of this blog know that I have announced or reported numerous attacks/vulnerabilities under the “ubercookies” series (1, 2, 3, 4, and part of 5) that allow a website to uncover a visitor’s identity, i.e., a Google/Facebook/Twitter handle. At the same time, connecting an online profile or email address to real-world information is becoming increasingly easy to automate. Putting two and two together, it is clear why one-click frauds could get very serious any day.

What might stop this logical progression of one-click frauds? Perhaps all identity-leak vulnerabilities will be found and fixed, but that’s a rather naïve hope, as the history of malware shows. Or maybe the public will eventually learn to resist the scam even in the face of a credible threat. That will take a long time, however, and a lot of damage will be done by then. Perhaps the technical skills required will remain beyond the reach of the scammers. But experience suggests that with a sufficiently lucrative prize, technical sophistication is no barrier—all it takes is one or two actual hackers; script-kiddie scammers can take care of the rest.

The best hope, as with any scam, is law enforcement. The authors list several factors, many specific to Japan, why the prosecution probability for one-click frauds is currently low. In addition, penalties for those who do get caught are also low: “One Click Frauds very often do not meet the legal tests necessary for qualifying as “fraud,” as in the vast majority of cases, the victim pays up immediately, and there is no active blackmailing effort from the miscreant.” A version of the scam that involved identity stealing would likely fall under the US Computer Fraud and Abuse Act or an equivalent, and would thus be more clearly illegal. Will this make a difference? Let’s wait and see.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

February 21, 2011 at 5:30 pm 2 comments

Older Posts Newer Posts


About 33bits.org

This is a blog about my research on breaking data anonymization, and more broadly about digital privacy, law and policy.

For an explanation of the blog title and more info, see the About page.

Subscribe

Be notified when there's a new post — subscribe to the feed, follow me on Google+ or twitter or use the email subscription box below.

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 68 other followers