I used to suck at giving technical talks. I would usually confuse my audience, and often confuse myself. By the time I became a prof, I sucked a lot less. These days I enjoy giving technical talks and lectures more than non-technical ones, and my students seem to like them better as well.
So something had changed; I’d developed a process. The other day I sat down to see if I could extract what this process was. It turned out to be surprisingly formulaic, like an algorithm, so I’d like to share it with you. I’m guessing this is obvious to most professors who teach technical topics, but I hope it will be helpful to those who’re relatively new to the game.
There are three steps. They’re simple but not easy.
- Identify the atomic concepts
- Draw the dependency graph
- Find a topological ordering of the graph
Identify atomic concepts. The key word here is atomic. The idea is to introduce only one key concept at one time and give the audience time to internalize the concept before moving on to the next one.
This is hard for two reasons. First, concepts that seem atomic to an expert are often an amalgam of different concepts. Second, it’s audience-specific. You have to have a good mental model of which concepts are already familiar to your audience.
Draw the dependency graph. Occasionally I use a whiteboard for this, but usually it’s in my head. This is a tricky step because it’s easy to miss dependencies. When the topic I’m teaching is the design of a technical system, I ask myself questions like, “what could go wrong in this component?” and “why wasn’t this alternative design used?” This helps me flesh out the internal logic of the system in the form of a graph.
Find a topological ordering. This is just a fancy way of saying we want to order the concepts so that each concept only depends on the ones already introduced. Sometimes this is straightforward, but sometimes the dependency graph has cycles!
Of the topics I’ve taught recently, Bitcoin seems especially difficult in this regard. Each concept is bootstrapped off of the others, but somehow the system magically works when you put everything together. What I do in these cases is introduce intermediate steps that don’t exist in the actual design I’m teaching, and remove them later .
Think of a technical topic as a skyscraper. When it’s presented in a paper, it’s analogous to unveiling a finished building. The audience can admire it and check that it’s stable/correct (say, by verifying theorems or other technical arguments.) But just as staring at a building doesn’t help you learn how to build one, the presentation in a typical paper is all but useless for pedagogical purposes. Having dependencies between concepts is perfectly acceptable in papers, because papers are not meant to be read in a single pass.
The instructor’s role, then, is to reverse engineer how the final concept might plausibly be built up step by step. This is analogous to showing the scaffolding of the building and explaining each step in its construction. Talks and lectures, unlike papers, must necessarily have this linear form because the audience can’t keep state in their heads.
 This process introduces new nodes in the dependency graph and removes some edges so that it is no longer cyclic.
At Princeton I get to advise many gifted graduate and undergraduate students in doing research. Combining my experience as a mentor with reflecting on my experience as a student, I’d like to offer some guidance on how to pick your first research project.
I’m writing this post because selecting a research problem to work on is significantly harder than actually solving it. I mean the previous sentence quite literally and without exaggeration. As an undergraduate and early-stage graduate researcher, I repeatedly spent months at a time working on research problems only to have to abandon my efforts because I found out I was barking up the wrong tree. Scientific research, it turns out, is largely about learning to ask the right questions.
The good news is that three simple criteria will help you avoid most of the common pitfalls.
1. Novelty. Original research is supposed to be, well, original. There are two components to novelty. The first is to make sure the problem you’re trying to solve hasn’t already been solved. This is way trickier than it seems — you might miss previous research because you’re using different names for concepts compared to the standard terminology. But the issue is deeper: two ideas may be equivalent without sounding at all the same at a superficial level. Your advisor’s help will be crucial here.
The other aspect to novelty is that you should have a convincing answer to the question “why has this problem not been solved yet?” Often this might involve a dataset that only recently became available, or some clever insight you’ve come up with that you suspect others missed. In practice, one often has an insight and then looks for a problem to apply it to. This means you have to put in a good bit of creative thinking even to pick a research question, and you must be able to estimate the difficulty level of solving it.
If your answer to the question is, “because the others who tried it weren’t smart enough,” you should probably think twice. It may not be prudent to have the success of your first project ride on your intellectual abilities being truly superlative.
2. Relevance. You must try to ensure that you select a problem that matters, one whose solution will impact the world directly or indirectly (and hopefully for the better). Again, your advisor’s help will be essential. (That said, professional researchers do produce massive volumes of research papers that no one cares about.) I encourage my students to pick subproblems of my ongoing long-term research projects. This is a safe way to pick a problem that’s relevant.
3. Measurable results. This one becomes automatic as you get experienced, but for beginning researchers it can be confusing. The output of your research should be measurable and reproducible; ideally you should be able to formulate your goals as a testable hypothesis. Measurability means that many interesting projects that are novel and make the world better are nevertheless unsuitable for research. (They may be ideal for a startup or a hobby project instead.) “Build a website for illiterate kids in poor countries to learn effectively” is an example of a task that’s hard to frame as a research question.
Irrelevant criteria. Let me also point out what’s not on this list. First, the general life advice you often hear, to do something you’re passionate about, is unfortunately a terrible way to pick a research problem. If you start from something you’re passionate about, the chance that it will meet the three criteria above is pretty slim. Often one has to consider a dozen or more research ideas before settling on one to work on.
You should definitely pick a research area you’re passionate about. But getting emotionally invested in a specific idea or research problem before you’ve done the due diligence is a classic mistake, one that I made a lot as a student.
Second, the scope or importance of the problem is another criterion you shouldn’t fret much about for your first project. Your goal is as much to learn the process of research as to produce results. You probably have a limited amount of time in which you want to evaluate if this whole research thing is the right fit for you. While you should definitely pick a useful and relevant research task, it should be something that you have a reasonable chance of carrying to fruition. Don’t worry about curing cancer just yet.
Note that the last point is at odds with advice given to more experienced researchers. Richard Hamming, in a famous talk titled “You and your research,” advised researchers to pick the most important problem that they have a shot at solving. I’ve written a version of the current post for those who’re in it for the long haul, and my advice there is to embrace risk and go for the big hits.
Aaron mentioned he was looking for one more speaker for this panel, so that we could hear the view of someone naive and inexperienced, and asked if I was available. I said, “Great, I do that every day!” So that will be the tone of my comments today. I don’t have any concrete proposals that can be implemented next year or in two years. Instead these are blue-sky thoughts on how things could work someday and hopeful suggestions for moving in that direction. 
I just finished my first year as a faculty member at Princeton. It’s still a bit surreal. I wasn’t expecting to have an academic career. In fact, back in grad school, especially the latter half, whenever someone asked me what I wanted to do after I graduated, my answer always was, “I don’t know for sure yet, but there’s one career I’m sure I don’t want — academia.”
I won’t go into the story of why that was and how it changed. But it led to some unusual behavior. I ranted a lot about academia on Twitter, as Aaron already mentioned when he introduced me. Also, many times I “published” stuff by putting up a blog post. For instance I had a series of posts on the ability of a malicious website to deanonymize visitors (1, 2, 3, 4, 5, 6). People encouraged me to turn it into a paper, and I could have done that without much extra effort. But I refused, because my primary goal was to quickly disseminate the information, and I felt my blog posts had accomplished that adequately. True, I wouldn’t get academic karma, but why would I care? I wasn’t going to be an academic!
When I eventually decided I wanted to apply for academic positions, I talked to a professor whose opinion I greatly respected. He expressed skepticism that I’d get any interviews, given that I’d been blogging instead of writing papers. I remember thinking, “oh shit, I’ve screwed up my career, haven’t I?” So I feel extremely lucky that my job search turned out successfully.
At this point a sane person would have decided to quit while they were ahead, and start playing the academic game. But I guess sanity has never really been one of my strong points. So in the last year I’ve been thinking a lot about what the process of research collaboration and publishing would look like if we somehow magically didn’t have to worry at all about furthering our individual reputations.
Something that’s very close to my ideal model of collaboration is the Polymath project. I was fascinated when I heard about it a few years ago. It was started by mathematician Tim Gowers in a blog post titled “Is massively collaborative mathematics possible?”  He and Terry Tao are the leaders of the project. They’re among the world’s top mathematicians. There have been several of these collaborations so far and they’ve been quite successful, solving previously open math problems. So I’ve been telling computer scientists about these efforts and asking if our community could produce something like this. 
To me there are three salient aspects of Polymath. The first is that the collaboration happens online, in blog posts and comments, rather than phone or physical meetings. When I tell people this they are usually enthusiastic and willing to try something like that. The second aspect is that it is open, in that there is no vetting of participants. Now people are a bit unsure, and say, “hmm, what’s the third?” Well, the third aspect is that there’s no keeping score of who contributed what. To which they react, “whoa, whoa, wait, what??!!”
I’m sure we can all see the problem here. Gowers and Tao are famous and don’t have to worry about furthering their careers. The other participants who contribute ideas seem to do it partly altruistically and partly because of the novelty of it. But it’s hard to imagine this process being feasible on a bigger scale.
Let’s take a step back and ask why there’s this gap between doing good research and getting credit for it. In almost every industry, every human endeavor, we’ve tried to set things up so that the incentives for individuals and the broader societal goals of the activity align with each other. But sometimes individual incentives get misaligned with the societal goals, and that leads to problems.
Let’s look at a few examples. Individual traders play the stock market with the hope of getting rich. But at the same time, it helps companies hedge against risk and improves overall financial stability. At least that’s the theory. We’ve seen it go wrong. Similarly, copyright is supposed to align the desire of creators to make money with the goal of the maximum number of people enjoying the maximum number of creative works. That’s gotten out of whack because of digital technology.
My claim is that we’re seeing the same problem in academic research. There’s a metaphor that explains what’s going on in research really well, and to me it is the root of all of the ills that I want to talk about. And that metaphor is publishing as competition. What do I mean by that? Well, peer review is a contest. Succeeding at this contest is the immediate incentive that we as researchers have. And we hope that this will somehow lead to science that benefits humanity.
To be clear, I’m far from the first one to make this observation. Let me quote someone who’s much better qualified to talk about this. Oded Goldreich, I’m sure most of you know of him, has a paper titled “On Struggle and Competition in Scientiﬁc Fields.” Here’s my favorite quote from the paper. He’s talking about the flagship theory conferences.
Eventually, FOCSTOC may become a pure competition, deﬁned as a competition having no aim but its own existence (i.e., the existence of a competition). That is, pure competitions serve no scientiﬁc purpose. Did FOCSTOC reach this point or is close to it? Let me leave this question open, and note that my impression is that things are deﬁnitely evolving towards this direction. In any case, I think we should all be worried about the potential of such an evolution.
I’m don’t know enough about the theory community to have an opinion on how big a problem this is. Still, I’m sure we can agree with the sentiment of the last sentence.
But here’s the very next paragraph. I think it gives us hope.
Other TOC conferences seem to suffer less from the aforementioned phenomena. This is mainly because they “count” less as evidence of importance (i.e., publications in them are either not counted by other competitions or their eﬀect on these competitions is less signiﬁcant). Thus, the vicious cycle described above is less powerful, and consequently these conferences may still serve the intended scientiﬁc purposes.
We see the same thing in the security and privacy community. Something I’ve seen commonly is a situation where you have a neat result, but nothing earth-shattering, and it’s not good enough as it is for a top tier venue. So what do you do? You pad it with bullshit and submit it, and it gets in. Another trend that this encourages is deliberately making a bad or inaccurate model so that you can solve a harder problem. But PETS publications and participants seem to suffer less from these effects. That’s why I’m happy to be discussing this issue with this group of people.
Paper as final output
It seems like we’re at an impasse. We can agree that publishing-as-competition has all these problems, but hiring committees and tenure committees need competitions to identify good research and good researchers. But I claim that publishing as competition fails even at the supposed goal of identifying useful research.
The reason for that is simple. Publishing as competition encourages or even forces viewing the paper as the final output. But it’s not! The hard work begins, not ends when the paper is published. This is unlike the math and theory communities, where the paper is in fact the final output. If publishing-as-competition is so bad for theory, it’s much worse for us.
In security and privacy research, the paper is the starting point. Our goal is not to prove theorems but to more directly impact the world in some way. By creating privacy technologies, for example. For research to have impact, authors have to do a variety of things after publication depending on the nature of the research. Build technology and get people to adopt it. Explain the work to policymakers or to other researchers who are building upon it. Or even just evangelize your ideas. Some people claim that ideas should stand on their own merit and compete with other ideas on a level playing field. I find this quite silly. I lean toward the view expressed in this famous quote you’ve probably heard: “if your ideas are any good you’ll have to shove them down people’s throats.”
The upshot of this is that impact is heavily shortchanged in the publication-as-competition model. This is partly because of what I’ve talked about, we have no incentive to do any more work after getting the paper published. But an equally important reason is that the community can’t judge the impact of research at the point of publication. Deciding who “wins the prizes” at the point of publication, before the ideas have a chance to prove themselves, has disastrous consequences.
So I hope I’ve convinced you that publication-as-competition is at the root of many of our problems. Let me give one more example. Many of us like the publish-then-filter model, where reviews are done in the open on publicly posted papers with anyone being able to comment. One major roadblock to moving to this model is that it screws up the competition aspect. The worry is that papers that receive a lot of popular attention will be reviewed favorably, and so forth. We want papers to be reviewed on a level playing field. But if the worth of a paper can’t be judged at publication time, that means all this fairness is toward an outcome that is meaningless anyway. Do we still want to keep this model at all costs?
A way forward?
So far I’ve done a lot of complaining. Let me offer some suggestions now. I want to give two sets of suggestions that are complementary. The first is targeted at committees, whether tenure committees, hiring committees, award communities, or even program committees to an extent, and to the community in general. The second is targeted at authors.
Here’s my suggestion for committees and the community: we can and should develop ways to incentivize and measure real impact. Let me give you a four examples. I have more that I’d be happy to discuss later. First, retrospective awards. That is, “best paper from this conference 10 years ago” or some such. I’ve been hearing more about these of late, and I think that’s good news. The idea is that impact is easier to evaluate 10 years after publication.
Second, overlay journals. These are online journals that are a way of “blessing” papers that have already been published or made public. There is a lag between initial publication and inclusion in the overlay journal, and that’s a good thing. Recently the math community has come up with a technical infrastructure for running overlay journals. I’m very excited about this. 
There are two more that are related. These are specific to our research field. For papers that are about a new tool, I think we should look at adoption numbers as an important component of the review process. Finally, such papers should also have an “incentives” section or subsection. Because all too often we write papers that we imagine unspecified parties will implement and deploy, but it turns out there isn’t the slightest economic incentive for any company or organization to do so.
I think we should also find ways to measure contributions through blog posts and sharing data and code in publications. This seems more tricky. I’d be happy to hear suggestions on how to do it.
Next, this is what I want to say to authors: the supposed lack of incentives for nontraditional ways of publishing is greatly exaggerated. I say this from my personal experience. I said earlier that I was very lucky that my job search turned out well. That’s true, but it wasn’t all luck. I found out to my surprise that my increased visibility through blogging and especially the policy work that came out of it made a huge difference to my prospects. If I’d had three times as many publications and no blog, I probably would have had about the same chances. I’m sure some departments didn’t like my style, but there are definitely others that truly value it.
My Bitcoin experiment
I have one other personal experience to share with you. This is an experiment I’ve been doing over the last month or so. I’d been thinking about the possibility of designing a prediction market on top of Bitcoin that doesn’t have a central point of control. Some of you may know the sad story of Intrade. So I tweeted my interest in this problem, and asked if others had put thought into it. Several people responded. I started an email thread for this group, and we went to work.
12,000 words and several conference calls later, we’re very happy with where we are, and we’ve started writing a paper presenting our design. What’s even better is who the participants are — Jeremy Clark at Carleton, Joe Bonneau who did his Ph.D. with Ross Anderson and is currently at Google, and Andrew Miller at UMD who is Jon Katz’s Ph.D. student. All these people are better qualified to write this paper than I am. By being proactive and reaching out online, I was able to assemble and work with this amazing team. 
But this experiment didn’t go all the way. While I used Twitter to find the participants and was open to accepting anyone, the actual collaboration is being done through traditional channels. My original intent was to do it in public, but I realized quite early on that we had something publication-worthy and became risk-averse.
I plan to do another experiment, this time with the explicit goal of doing it in public. This is again a Bitcoin-related paper that I want to write. Oddly enough, there is no proper tutorial of Bitcoin, nor is there a survey of the current state of research. I think combining these would make a great paper. The nature of the project makes it ideal to do online. I haven’t figured out the details yet, but I’m going to launch it on my blog and see how it goes. You’re all welcome to join me in this experiment. 
So that’s basically what I wanted to share with you today. I think the current model of publication as competition has gone too far, and the consequences are starting to get ruinous. It’s time we put a stop to it. I believe that committees on one hand, and authors on the other both have the incentive to start changing things unilaterally. But if the two are combined, the results can be especially powerful. In fact, I hope that it can lead to a virtuous cycle. Thank you.
 Aaron didn’t actually say that, of course. You probably got that. But who knows if nuances come across in transcripts.
 At this point I polled the room to see who’d heard of Polymath before. Only three hands went up (!)
 There is one example that’s closer to computer science that I’m aware of: this book on homotopy type theory written in a similar spirit as the Polymath project.
 Something that I meant to mention at the end but ran out of time for is Michael Neilsen’s excellent book Reinventing Discovery: The New Era of Networked Science. If you find the topic of this post at all interesting, you should absolutely read this book.
Given the pervasive tracking and profiling of our shopping and browsing habits, one would expect that retailers would be very good at individualized price discrimination — figuring out what you or I would be willing to pay for an item using data mining, and tailoring prices accordingly. But this doesn’t seem to be happening. Why not?
This mystery isn’t new. Mathematician Andrew Odlyzko predicted a decade ago that data-driven price discrimination would become much more common and effective (paper, interview). Back then, he was far ahead of his time. But today, behavioral advertising at least has gotten good enough that it’s often creepy. The technology works; the impediment to price discrimination lies elsewhere. 
It looks like consumers’ perception of unfairness of price discrimination is surprisingly strong, which is why firms balk at overt price discrimination, even though covert price discrimination is all too common. But the covert form of price discrimination is not only less efficient, it also (ironically) has significant social costs — see #3 below for an example. Is there a form of pricing that allows for perfect discrimination (i.e., complete tailoring to individuals), in a way that consumers find acceptable? That would be the holy grail.
In this post, I will argue that the humble coupon, reborn in a high-tech form, could be the solution. Here’s why.
1. Coupons tap into shopper psychology. Customers love them.
Coupons, like sales, introduce unpredictability and rewards into shopping, which provides a tiny dopamine spike that gets us hooked. JC Penney’s recent misadventure in trying to eliminate sales and coupons provides an object lesson:
“It may be a decent deal to buy that item for $5. But for someone like me, who’s always looking for a sale or a coupon — seeing that something is marked down 20 percent off, then being able to hand over the coupon to save, it just entices me. It’s a rush.”
Some startups have exploited this to the hilt, introducing “gamification” into commerce. Shopkick is a prime example. I see this as a very important trend.
2. Coupons aren’t perceived as unfair.
Given the above, shoppers have at best a dim perception of coupons as a price discrimination mechanism. Even when they do, however, coupons aren’t perceived as unfair to nearly the same degree as listing different prices for different consumers, even if the result in either case is identical. 
3. Traditional coupons are not personalized.
While customers may have different reasons for liking coupons, from firms’ perspective the way in which traditional coupons aid price discrimination is pretty simple: by forcing customers to waste their time. Econ texts tend to lay it out bluntly. For example, R. Preston McAfee:
Individuals generally value their time at approximately their wages, so that people with low wages, who tend to be the most price-sensitive, also have the lowest value of time. … A thrifty shopper may be able to spend an hour sorting through the coupons in the newspaper and save $20 on a $200 shopping expedition … This is a good deal for a consumer who values time at less than $20 per hour, and a bad deal for the consumer that values time in excess of $20 per hour. Thus, relatively poor consumers choose to use coupons, which permits the seller to have a price cut that is approximately targeted at the more price-sensitive group.
Clearly, for this to be effective, coupon redemption must be deliberately made time-consuming.
To the extent that there is coupon personalization, it seems to be for changing shopper behavior (e.g., getting them to try out a new product) rather than a pricing mechanism. The NYT story from last year about Target targeting pregnant women falls into this category. That said, these different forms of personalization aren’t entirely distinct, which is a point I will return to in a later article.
4. The traditional model doesn’t work well any more.
Paper coupons have a limited future. As for digital coupons, there is a natural progression toward interfaces that make it easier to acquire and redeem them. In particular, as more shoppers start to pay using their phones in stores, I anticipate coupon redemption being integrated into payment apps, thus becoming almost frictionless.
An interesting side-effect of smartphone-based coupon redemption is that it gives the shopper more privacy, avoiding the awkwardness of pulling out coupons from a purse or wallet. This will further open up coupons to a wealthier demographic, making them even less effective at discriminating between wealthier shoppers and less affluent ones.
5. The coupon is being reborn in a data-driven, personalized form.
With behavioral profiling, companies can determine how much a consumer will pay for a product, and deliver coupons selectively so that each customer’s discount reflects what they are willing to pay. They key difference is what while in the past, customers decided whether or not to look for, collect, and use a coupon, in the new model companies will determine who gets which coupons.
In the extreme, coupons will be available for all purchases, and smart shopping software on our phones or browsers will automatically search, aggregate, manage, and redeem these coupons, showing coupon-adjusted prices when browsing for products. More realistically, the process won’t be completely frictionless, since that would lose the psychological benefit. Coupons will probably also merge with “rewards,” “points,” discounts, and various other incentives.
There have been rumblings of this shift here and there for a few years now, and it seems to be happening gradually. Google’s acquisition of Incentive Targeting a few months ago seems significant, and at the very least demonstrates that tech companies are eyeing this space as well, and not just retailers. As digital feudalism takes root, it could accelerate the trend of individualized shopping experiences.
In summary, personalized coupons offer a vehicle for realizing the full potential of data mining for commerce by tailoring prices in a way that consumers seem to find acceptable. Neither coupons nor price discrimination should be viewed in isolation — together with rewards and various other incentive schemes, they are part of the trend of individualized, data mining-driven commerce that’s here to stay.
 Since I’m eschewing some academic terminology in this post, here are a few references and points of clarification. My interest is in first-degree price discrimination. Any price discrimination requires market power; my assumption is that is the case in practice because competition is always imperfect, and we should expect quite a bit of first-degree price discrimination. The observed level is puzzlingly low.
The impact of technology on the ability to personalize prices is complex, and behavioral profiling is only one aspect. Technology also makes competition less perfect by allowing firms to customize products to a greater degree, so that there are no exact substitutes. Finally, technology hinders first-degree price discrimination to an extent by allowing consumers to compare prices between different retailers more easily. The interaction between these effects is analyzed in this paper.
Technology also increases the incentive to price discriminate. As production becomes more and more automated, marginal costs drop relative to fixed costs. In the extreme, digital goods have essentially zero marginal cost. When marginal production costs are low, firms will try to tailor prices since any sale above marginal cost increases profits.
My use of the terms overt and covert is rooted in the theory of price fairness in psychology and behavioral economics, and relates to the presentation of the transaction. While it is somewhat related to first- vs. second/third-degree price discrimination, it is better understood as a separate axis, one that is not captured by theories of rational firms and consumers.
 An exception is when non-coupon customers are made aware that others are getting a better deal. This happens, for example, when there is a prominent coupon-code form field in an online shopping checkout flow. See here for a study.
Thanks to Sebastian Gold for reviewing a draft, and to Justin Brickell for interesting conversations that led me to this line of thinking.
What really drives reidentification researchers? Do we publish these demonstrations to alert individuals to privacy risks? To shame companies? For personal glory? If our goal is to improve privacy, are we doing it in the best way possible?
In this post I’d like to discuss my own motivations as a reidentification researcher, without speaking for anyone else. Certainly I care about improving privacy outcomes, in the sense of making sure that companies, governments and others don’t get away with mathematically unsound promises about the privacy of consumers’ data. But there is a quite different goal I care about at least as much: reidentification algorithms. These algorithms are my primary object of study, and so I see reidentification research partly as basic science.
Let me elaborate on why reidentification algorithms are interesting and important. First, they yield fundamental insights about people — our interests, preferences, behavior, and connections — as reflected in the datasets collected about us. Second, as is the case with most basic science, these algorithms turn out to have a variety of applications other than reidentification, both for good and bad. Let us consider some of these.
First and foremost, reidentification algorithms are directly applicable in digital forensics and intelligence. Analyzing the structure of a terrorist network (say, based on surveillance of movement patterns and meetings) to assign identities to nodes is technically very similar to social network deanonymization. A reidentification researcher that I know who is a U.S. citizen tells me he has been contacted more than once by intelligence agencies to apply his expertise to their data.
Homer et al’s work on identifying individuals in DNA mixtures is another great example of how forensics algorithms are inextricably linked to privacy-infringing applications. In addition to DNA and network structure, writing style and location trails are other attributes that have been utilized both in reidentification and forensics.
It is not a coincidence that the reidentification literature often uses the word “fingerprint” — this body of work has generalized the notion of a fingerprint beyond physical attributes to a variety of other characteristics. Just like physical fingerprints, there are good uses and bad, but regardless, finding generalized fingerprints is a contribution to human knowledge. A fundamental question is how much information (i.e., uniqueness) there is in each of these types of attributes or characteristics. Reidentification research is gradually helping answer this question, but much remains unknown.
It is not only people that are fingerprintable — so are various physical devices. A wonderful set of (unrelated) research papers has shown that many types of devices, objects, and software systems, even supposedly identical ones, are have unique fingerprints: blank paper, digital cameras, RFID tags, scanners and printers, and web browsers, among others. The techniques are similar to reidentification algorithms, and once again straddle security-enhancing and privacy-infringing applications.
Even more generally, reidentification algorithms are classification algorithms for the case when the number of classes is very large. Classification algorithms categorize observed data into one of several classes, i.e., categories. They are at the core of machine learning, but typical machine-learning applications rarely need to consider more than several hundred classes. Thus, reidentification science is helping develop our knowledge of how best to extend classification algorithms as the number of classes increases.
Moving on, research on reidentification and other types of “leakage” of information reveals a problem with the way data-mining contests are run. Most commonly, some elements of a dataset are withheld, and contest participants are required to predict these unknown values. Reidentification allows contestants to bypass the prediction process altogether by simply “looking up” the true values in the original data! For an example and more elaborate explanation, see this post on how my collaborators and I won the Kaggle social network challenge. Demonstrations of information leakage have spurred research on how to design contests without such flaws.
If reidentification can cause leakage and make things messy, it can also clean things up. In a general form, reidentification is about connecting common entities across two different databases. Quite often in real-world datasets there is no unique identifier, or it is missing or erroneous. Just about every programmer who does interesting things with data has dealt with this problem at some point. In the research world, William Winkler of the U.S. Census Bureau has authored a survey of “record linkage”, covering well over a hundred papers. I’m not saying that the high-powered machinery of reidentification is necessary here, but the principles are certainly useful.
In my brief life as an entrepreneur, I utilized just such an algorithm for the back-end of the web application that my co-founders and I built. The task in question was to link a (musical) artist profile from last.fm to the corresponding Wikipedia article based on discography information (linking by name alone fails in any number of interesting ways.) On another occasion, for the theory of computing blog aggregator that I run, I wrote code to link authors of papers uploaded to arXiv to their DBLP profiles based on the list of coauthors.
There is more, but I’ll stop here. The point is that these algorithms are everywhere.
If the algorithms are the key, why perform demonstrations of privacy failures? To put it simply, algorithms can’t be studied in a vacuum; we need concrete cases to test how well they work. But it’s more complicated than that. First, as I mentioned earlier, keeping the privacy conversation intellectually honest is one of my motivations, and these demonstrations help. Second, in the majority of cases, my collaborators and I have chosen to examine pairs of datasets that were already public, and so our work did not uncover the identities of previously anonymous subjects, but merely helped to establish that this could happen in other instances of “anonymized” data sharing.
Third, and I consider this quite unfortunate, reidentification results are taken much more seriously if researchers do uncover identities, which naturally gives us an incentive to do so. I’ve seen this in my own work — the Netflix paper is the most straightforward and arguably the least scientifically interesting reidentification result that I’ve co-authored, and yet it received by far the most attention, all because it was carried out on an actual dataset published by a company rather than demonstrated hypothetically.
My primary focus on the fundamental research aspect of reidentification guides my work in an important way. There are many, many potential targets for reidentification — despite all the research, data holders often (rationally) act like nothing has changed and continue to make data releases with “PII” removed. So which dataset should I pick to work on?
Focusing on the algorithms makes it a lot easier. One of my criteria for picking a reidentification question to work on is that it must lead to a new algorithm. I’m not at all saying that all reidentification researchers should do this, but for me it’s a good way to maximize the impact I can hope for from my research, while minimizing controversies about the privacy of the subjects in the datasets I study.
I hope this post has given you some insight into my goals, motivations, and research outputs, and an appreciation of the fact that there is more to reidentification algorithms than their application to breaching privacy. It will be useful to keep this fact in the back of our minds as we continue the conversation on the ethics of reidentification.
Thanks to Vitaly Shmatikov for reviewing a draft.
In earlier posts about the privacy technologies course I taught at Princeton during Fall 2012, I described how I refuted privacy myths, and presented an annotated syllabus. In this concluding post I will offer some additional tidbits about the course.
Wiki. I referred to a Wiki a few times in my earlier post, and you might wonder what that was about. The course included an online Wiki discussion component, and this was in fact the centerpiece. Students were required to participate in the online discussion of the day’s readings before coming to class. The in-class discussion would use the Wiki discussion as a starting point.
The advantages of this approach are: 1. it gives the instructor a great degree of control in shaping the discussion of each paper, 2. the instructor can more closely monitor individual students’ progress 3. class discussion can focus on particularly tricky and/or contentious points, instead of rehashing the obvious.
Student projects. Students picked a variety of final projects for the class, and on the whole exceeded my expectations. Here are two very different projects, in brief.
Nora Taranto, a History of Science major, wrote a policy paper about the privacy implications of the shift to electronic medical records. Nora writes:
I wrote a paper about the privacy implications of patient-care institutions (in the United States) using electronic medical record (EMR) systems more and more frequently. This topic had particular relevance given the huge number of privacy breaches that occurred in 2012 alone. Meanwhile, there is a simultaneous criticism coming from care providers about the usability of such EMR systems. As such, many different communities—in the information privacy sphere, in the medical community, in the general public, etc.—have many different things to say. But, given the several privacy breaches that occurred within a couple of weeks in April 2012 and together implicated over a million individuals, concerns have been raised in particular about how secure EMR systems are. These concerns are especially worrisome given the federal government’s push for their adoption nationwide beginning in 2009 when the American Recovery and Reinvestment Act granting funds to hospitals explicitly for the purpose of EMR implementation.
So I looked into the benefits and costs of such systems, with a particular slant towards the privacy benefits/costs. Overall, these systems do have a number of protective mechanisms at their disposal, some preventative and others reactive. While these protective barriers are all necessary, they are not sufficient to guarantee the patient his or her privacy rights in the modern day. These protective mechanisms—authentication schemes, encryption, and data logs/anomaly-detection—need to be expanded and further developed to provide an adequate amount of protection for personal health information. While the government is, at the moment, encouraging the adoption of EMR systems for maximal penetration, medical institutions ought to use caution in considering which systems to implement and ought to hold themselves to a higher standard. Moreover, greater regulatory oversight of EMR systems on the market would help institutions maintain this cautious approach.
Abu Saparov, Ajay Roopakalu, and Raﬁ Shamim, also undergraduates, designed an implemented an alternative to centralized key distribution. They write:
Our project for the course was to create and implement a decentralized public key distribution protocol and show how it could be used. One of the initial goals of our project was to experience first-hand some of the things that made the design of a usable and useful privacy application so hard. Early on in the process, we decided to try to build some type of application that used cryptography to enhance the privacy of communication with friends. Some of the reasons that we chose this general topic were the fact that all of us had experience with network programming and that we thought some of the things that cryptography can achieve are uniquely cool. We were also somewhat motivated by the prospect of using our application to talk with each other and our other friends after we graduate. We eventually gravitated towards two ideas: (1) a completely peer-to-peer chat system that is encrypted from end-to-end, and (2) a “dumb” social network that allows users to share posts that only their friends (and not the server) can see. During the semester, our focus shifted to designing and implementing the underlying key distribution mechanism upon which these two systems could be built.
When we began to flesh out the designs for our two ideas, we realized that the act of retrieving a friend’s public cryptographic keys was the first challenge to solve. Certificate authorities are the most common way to obtain public keys, but require a large degree of trust to be placed in a small number of authorities. Web of Trust is another option, and is completely decentralized, but often proves difficult in practice because of the need for manual key signing. We decided to make our own decentralized protocol that exposes an easily usable API for clients to use in order to obtain public keys. Our protocol defines an overlay network that features regular nodes, as well as supernodes that are able to prove their trustworthiness, although the details of this are controllable through a policy delegate. The idea is for supernodes to share the task of remembering and verifying public keys through a majority vote of neighboring supernodes. Users running other nodes can ask the supernodes for a friend’s public key. In order to trick someone, an adversary would have to control over half of the supernodes from which a user requested a key. Our decision to go with an overlay network created a variety of issues such as synchronizing information between supernodes, being able to detect and report malicious supernodes, and getting new nodes incorporated into the network. These and the countless other design problems we faced definitely allowed us to appreciate the difficulty of writing a privacy application, but unfortunately, we were not fully able to test every element of our protocol and its implementation. After creating the protocol, we implemented small, bare-bones applications for our initial ideas of peer-to-peer chat and an encrypted social network.
Master’s students Chris Eubank, Marcela Melara, and Diego Perez-Botero did a project on mobile web tracking which, with some further work, turned into a research paper that Chris will speak about at W2SP tomorrow.
Finally, I’m happy to say that I will be discussing the syllabus and my experiences teaching this class at HotPETs this year, in Bloomington, IN, in July.
Last October I gave a talk titled “What Happened to the Crypto Dream?” where I looked at why crypto seems to have done little for personal privacy. The reaction from the audience (physical and online) was quite encouraging — not that everyone agreed, but they seemed to find it thought provoking — and several people asked me if I’d turn it into a paper. So when Prof. Alessandro Acquisti invited me to contribute an essay to the “On the Horizon” column in IEEE S&P magazine, I jumped at the chance, and suggested this topic.
While I’m not saying anything earth shaking, I do make a somewhat nuanced argument — I distinguish between “crypto for security” and “crypto for privacy,” and further subdivide the latter into a spectrum between what I call “Cypherpunk Crypto” and “Pragmatic Crypto.” I identify different practical impediments that apply to those two flavors (in the latter case, a complex of related factors), and lay out a few avenues for action that can help privacy-enhancing crypto move in a direction more relevant to practice.
I’m aware that this is a contentious topic, especially since some people feel that the time is ripe for a resurgence of the cypherpunk vision. I’m happy to hear your reactions.