Posts filed under ‘Uncategorized’
What really drives reidentification researchers? Do we publish these demonstrations to alert individuals to privacy risks? To shame companies? For personal glory? If our goal is to improve privacy, are we doing it in the best way possible?
In this post I’d like to discuss my own motivations as a reidentification researcher, without speaking for anyone else. Certainly I care about improving privacy outcomes, in the sense of making sure that companies, governments and others don’t get away with mathematically unsound promises about the privacy of consumers’ data. But there is a quite different goal I care about at least as much: reidentification algorithms. These algorithms are my primary object of study, and so I see reidentification research partly as basic science.
Let me elaborate on why reidentification algorithms are interesting and important. First, they yield fundamental insights about people — our interests, preferences, behavior, and connections — as reflected in the datasets collected about us. Second, as is the case with most basic science, these algorithms turn out to have a variety of applications other than reidentification, both for good and bad. Let us consider some of these.
First and foremost, reidentification algorithms are directly applicable in digital forensics and intelligence. Analyzing the structure of a terrorist network (say, based on surveillance of movement patterns and meetings) to assign identities to nodes is technically very similar to social network deanonymization. A reidentification researcher that I know who is a U.S. citizen tells me he has been contacted more than once by intelligence agencies to apply his expertise to their data.
Homer et al’s work on identifying individuals in DNA mixtures is another great example of how forensics algorithms are inextricably linked to privacy-infringing applications. In addition to DNA and network structure, writing style and location trails are other attributes that have been utilized both in reidentification and forensics.
It is not a coincidence that the reidentification literature often uses the word “fingerprint” — this body of work has generalized the notion of a fingerprint beyond physical attributes to a variety of other characteristics. Just like physical fingerprints, there are good uses and bad, but regardless, finding generalized fingerprints is a contribution to human knowledge. A fundamental question is how much information (i.e., uniqueness) there is in each of these types of attributes or characteristics. Reidentification research is gradually helping answer this question, but much remains unknown.
It is not only people that are fingerprintable — so are various physical devices. A wonderful set of (unrelated) research papers has shown that many types of devices, objects, and software systems, even supposedly identical ones, are have unique fingerprints: blank paper, digital cameras, RFID tags, scanners and printers, and web browsers, among others. The techniques are similar to reidentification algorithms, and once again straddle security-enhancing and privacy-infringing applications.
Even more generally, reidentification algorithms are classification algorithms for the case when the number of classes is very large. Classification algorithms categorize observed data into one of several classes, i.e., categories. They are at the core of machine learning, but typical machine-learning applications rarely need to consider more than several hundred classes. Thus, reidentification science is helping develop our knowledge of how best to extend classification algorithms as the number of classes increases.
Moving on, research on reidentification and other types of “leakage” of information reveals a problem with the way data-mining contests are run. Most commonly, some elements of a dataset are withheld, and contest participants are required to predict these unknown values. Reidentification allows contestants to bypass the prediction process altogether by simply “looking up” the true values in the original data! For an example and more elaborate explanation, see this post on how my collaborators and I won the Kaggle social network challenge. Demonstrations of information leakage have spurred research on how to design contests without such flaws.
If reidentification can cause leakage and make things messy, it can also clean things up. In a general form, reidentification is about connecting common entities across two different databases. Quite often in real-world datasets there is no unique identifier, or it is missing or erroneous. Just about every programmer who does interesting things with data has dealt with this problem at some point. In the research world, William Winkler of the U.S. Census Bureau has authored a survey of “record linkage”, covering well over a hundred papers. I’m not saying that the high-powered machinery of reidentification is necessary here, but the principles are certainly useful.
In my brief life as an entrepreneur, I utilized just such an algorithm for the back-end of the web application that my co-founders and I built. The task in question was to link a (musical) artist profile from last.fm to the corresponding Wikipedia article based on discography information (linking by name alone fails in any number of interesting ways.) On another occasion, for the theory of computing blog aggregator that I run, I wrote code to link authors of papers uploaded to arXiv to their DBLP profiles based on the list of coauthors.
There is more, but I’ll stop here. The point is that these algorithms are everywhere.
If the algorithms are the key, why perform demonstrations of privacy failures? To put it simply, algorithms can’t be studied in a vacuum; we need concrete cases to test how well they work. But it’s more complicated than that. First, as I mentioned earlier, keeping the privacy conversation intellectually honest is one of my motivations, and these demonstrations help. Second, in the majority of cases, my collaborators and I have chosen to examine pairs of datasets that were already public, and so our work did not uncover the identities of previously anonymous subjects, but merely helped to establish that this could happen in other instances of “anonymized” data sharing.
Third, and I consider this quite unfortunate, reidentification results are taken much more seriously if researchers do uncover identities, which naturally gives us an incentive to do so. I’ve seen this in my own work — the Netflix paper is the most straightforward and arguably the least scientifically interesting reidentification result that I’ve co-authored, and yet it received by far the most attention, all because it was carried out on an actual dataset published by a company rather than demonstrated hypothetically.
My primary focus on the fundamental research aspect of reidentification guides my work in an important way. There are many, many potential targets for reidentification — despite all the research, data holders often (rationally) act like nothing has changed and continue to make data releases with “PII” removed. So which dataset should I pick to work on?
Focusing on the algorithms makes it a lot easier. One of my criteria for picking a reidentification question to work on is that it must lead to a new algorithm. I’m not at all saying that all reidentification researchers should do this, but for me it’s a good way to maximize the impact I can hope for from my research, while minimizing controversies about the privacy of the subjects in the datasets I study.
I hope this post has given you some insight into my goals, motivations, and research outputs, and an appreciation of the fact that there is more to reidentification algorithms than their application to breaching privacy. It will be useful to keep this fact in the back of our minds as we continue the conversation on the ethics of reidentification.
Thanks to Vitaly Shmatikov for reviewing a draft.
In earlier posts about the privacy technologies course I taught at Princeton during Fall 2012, I described how I refuted privacy myths, and presented an annotated syllabus. In this concluding post I will offer some additional tidbits about the course.
Wiki. I referred to a Wiki a few times in my earlier post, and you might wonder what that was about. The course included an online Wiki discussion component, and this was in fact the centerpiece. Students were required to participate in the online discussion of the day’s readings before coming to class. The in-class discussion would use the Wiki discussion as a starting point.
The advantages of this approach are: 1. it gives the instructor a great degree of control in shaping the discussion of each paper, 2. the instructor can more closely monitor individual students’ progress 3. class discussion can focus on particularly tricky and/or contentious points, instead of rehashing the obvious.
Student projects. Students picked a variety of final projects for the class, and on the whole exceeded my expectations. Here are two very different projects, in brief.
Nora Taranto, a History of Science major, wrote a policy paper about the privacy implications of the shift to electronic medical records. Nora writes:
I wrote a paper about the privacy implications of patient-care institutions (in the United States) using electronic medical record (EMR) systems more and more frequently. This topic had particular relevance given the huge number of privacy breaches that occurred in 2012 alone. Meanwhile, there is a simultaneous criticism coming from care providers about the usability of such EMR systems. As such, many different communities—in the information privacy sphere, in the medical community, in the general public, etc.—have many different things to say. But, given the several privacy breaches that occurred within a couple of weeks in April 2012 and together implicated over a million individuals, concerns have been raised in particular about how secure EMR systems are. These concerns are especially worrisome given the federal government’s push for their adoption nationwide beginning in 2009 when the American Recovery and Reinvestment Act granting funds to hospitals explicitly for the purpose of EMR implementation.
So I looked into the benefits and costs of such systems, with a particular slant towards the privacy benefits/costs. Overall, these systems do have a number of protective mechanisms at their disposal, some preventative and others reactive. While these protective barriers are all necessary, they are not sufficient to guarantee the patient his or her privacy rights in the modern day. These protective mechanisms—authentication schemes, encryption, and data logs/anomaly-detection—need to be expanded and further developed to provide an adequate amount of protection for personal health information. While the government is, at the moment, encouraging the adoption of EMR systems for maximal penetration, medical institutions ought to use caution in considering which systems to implement and ought to hold themselves to a higher standard. Moreover, greater regulatory oversight of EMR systems on the market would help institutions maintain this cautious approach.
Abu Saparov, Ajay Roopakalu, and Raﬁ Shamim, also undergraduates, designed an implemented an alternative to centralized key distribution. They write:
Our project for the course was to create and implement a decentralized public key distribution protocol and show how it could be used. One of the initial goals of our project was to experience first-hand some of the things that made the design of a usable and useful privacy application so hard. Early on in the process, we decided to try to build some type of application that used cryptography to enhance the privacy of communication with friends. Some of the reasons that we chose this general topic were the fact that all of us had experience with network programming and that we thought some of the things that cryptography can achieve are uniquely cool. We were also somewhat motivated by the prospect of using our application to talk with each other and our other friends after we graduate. We eventually gravitated towards two ideas: (1) a completely peer-to-peer chat system that is encrypted from end-to-end, and (2) a “dumb” social network that allows users to share posts that only their friends (and not the server) can see. During the semester, our focus shifted to designing and implementing the underlying key distribution mechanism upon which these two systems could be built.
When we began to flesh out the designs for our two ideas, we realized that the act of retrieving a friend’s public cryptographic keys was the first challenge to solve. Certificate authorities are the most common way to obtain public keys, but require a large degree of trust to be placed in a small number of authorities. Web of Trust is another option, and is completely decentralized, but often proves difficult in practice because of the need for manual key signing. We decided to make our own decentralized protocol that exposes an easily usable API for clients to use in order to obtain public keys. Our protocol defines an overlay network that features regular nodes, as well as supernodes that are able to prove their trustworthiness, although the details of this are controllable through a policy delegate. The idea is for supernodes to share the task of remembering and verifying public keys through a majority vote of neighboring supernodes. Users running other nodes can ask the supernodes for a friend’s public key. In order to trick someone, an adversary would have to control over half of the supernodes from which a user requested a key. Our decision to go with an overlay network created a variety of issues such as synchronizing information between supernodes, being able to detect and report malicious supernodes, and getting new nodes incorporated into the network. These and the countless other design problems we faced definitely allowed us to appreciate the difficulty of writing a privacy application, but unfortunately, we were not fully able to test every element of our protocol and its implementation. After creating the protocol, we implemented small, bare-bones applications for our initial ideas of peer-to-peer chat and an encrypted social network.
Master’s students Chris Eubank, Marcela Melara, and Diego Perez-Botero did a project on mobile web tracking which, with some further work, turned into a research paper that Chris will speak about at W2SP tomorrow.
Finally, I’m happy to say that I will be discussing the syllabus and my experiences teaching this class at HotPETs this year, in Bloomington, IN, in July.
Last October I gave a talk titled “What Happened to the Crypto Dream?” where I looked at why crypto seems to have done little for personal privacy. The reaction from the audience (physical and online) was quite encouraging — not that everyone agreed, but they seemed to find it thought provoking — and several people asked me if I’d turn it into a paper. So when Prof. Alessandro Acquisti invited me to contribute an essay to the “On the Horizon” column in IEEE S&P magazine, I jumped at the chance, and suggested this topic.
While I’m not saying anything earth shaking, I do make a somewhat nuanced argument — I distinguish between “crypto for security” and “crypto for privacy,” and further subdivide the latter into a spectrum between what I call “Cypherpunk Crypto” and “Pragmatic Crypto.” I identify different practical impediments that apply to those two flavors (in the latter case, a complex of related factors), and lay out a few avenues for action that can help privacy-enhancing crypto move in a direction more relevant to practice.
I’m aware that this is a contentious topic, especially since some people feel that the time is ripe for a resurgence of the cypherpunk vision. I’m happy to hear your reactions.
Last semester I taught a course on privacy technologies here at Princeton. Earlier I discussed how I refuted privacy myths that students brought into class. In this post I’d like to discuss the contents of the course. I hope it will be useful to other instructors who are interested in teaching this topic as well as for students undertaking self-study of privacy technologies. Beware: this post is quite long.
What should be taught in a class on privacy technologies? Before we answer that, let’s take a step back and ask, how does one go about figuring out what should be taught in any class?
I’ve seen two approaches. The traditional, default, overwhelmingly common approach is to think of it in terms of “covering content” without much consideration to what students are getting out of it. The content that’s deemed relevant is often determined by what the fashionable research areas happen to be, or historical accident, or some combination thereof.
A contrasting approach, promoted by authors like Bain, applies a laser focus on skills that students will acquire and how they will apply them later in life. On teaching orientation day at Princeton, our instructor, who clearly subscribed to this approach, had each professor describe what students would do in the class they are teaching, then wrote down only the verbs from these descriptions. The point was that our thinking had to be centered around skills that students would take home.
I prefer a middle ground. It should be apparent from my description of the traditional approach above that I’m not a fan. On the other hand, I have to wonder what skills our teaching coach would have suggested for a course on cosmology — avoiding falling into black holes? Alright, I’m exaggerating to make a point. The verbs in question are words like “synthesize” and “evaluate,” so there would be no particular difficulty in applying them to cosmology. But my point is that in a cosmology course, I’m not sure the instructor should start from these verbs.
Sometimes we want students to be exposed to knowledge primarily because it is beautiful, and being able to perceive that beauty inspires us, instills us with a love of further learning, and I dare say satisfies a fundamental need. To me a lot of the crypto “magic” that goes into privacy technologies falls into that category (not that it doesn’t have practical applications).
With that caveat, however, I agree with the emphasis on skills and life impact. I thought of my students primarily as developers of privacy technologies (and more generally, of technological systems that incorporate privacy considerations), but also as users and scholars of privacy technologies.
I organized the course into sections, a short introductory section followed by five sections that alternated in the level of math/technical depth. Every time we studied a technology, we also discussed its social/economic/political aspects. I had a great deal of discretion in guiding where the conversation around the papers went by giving them questions/prompts on the class Wiki. Let us now jump in. The italicized text is from the course page, the rest is my annotation.
Goals of this section: Why are we here? Who cares about privacy? What might the future look like?
- Dan Solove. Why Privacy Matters Even if You Have ‘Nothing to Hide’ (Chronicle)
- David Brin. The Transparent Society (WIRED, circa 1996, later expanded into a book)
In addition to helping flesh out the foundational assumptions of this course that I discussed in the previous post, pairing these opposing views with each other helped make the point that there are few absolutes in this class, that privacy scholars may disagree with each other, and that the instructor doesn’t necessarily agree with the viewpoints in the assigned reading, much less expects students to.
1. Cryptography: power and limitations
Goals. Travel back in time to the 80s and early 90s, understand the often-euphoric vision that many crypto pioneers and hobbyists had for the impact it would have. Understand how cryptographic building blocks were thought to be able to support this restructuring of society. Reason about why it didn’t happen.
Understand the motivations and mathematical underpinnings of the modern research on privacy-preserving computations. Experiment with various encryption tools, discover usability problems and other limitations of crypto.
- David Chaum. Security without Identification: Card Computers to make Big Brother Obsolete (1985)
- Steven Levy. Crypto Rebels (WIRED, 1993; later a 2001 book)
- Eric Hughes. A cypherpunk’s manifesto. (short essay, 1993.)
I think the Chaum paper is a phenomenal and underutilized resource for teaching. My goal was to really immerse students in an alternate reality where the legal underpinnings of commerce were replaced by cryptography, much as Chaum envisioned (and even going beyond that). I created a couple of e-commerce scenarios for Wiki discussion and had them reason about how various functions would be accomplished.
My own views on this topic are set forth in this talk (now a paper; coming soon). In general I aimed to shield students from my viewpoints, and saw my role as helping them discover (and be able to defend) their own. At least in this instance I succeeded. Some students took the position that the cypherpunk dream is just around the corner.
- The ‘Garbled Circuit Protocol’ (Yao’s theorem on secure two-party computation) and its implications (lecture)
This is one of the topics that sadly suffers from a lack of good expository material, so I instead lectured on it.
- Alma Whitten and Doug Tygar. Why Johnny Can’t Encrypt: A Usability Evaluation of PGP 5.0
- Nikita Borisov, Ian Goldberg, Eric Brewer. Off-the-Record Communication, or, Why Not To Use PGP
One of the exercises here was to install and use various crypto tools and rediscover the usability problems. The difficulties were even worse than I’d anticipated.
2. Data collection and data mining, economics of personal data, behavioral economics of privacy
Goals. Jump forward in time to the present day and immerse ourselves in the world of ubiquitous data collection and surveillance. Discover what kinds of data collection and data mining are going on, and why. Discuss how and why the conversation has shifted from Government surveillance to data collection by private companies in the last 20 years.
Theme: first-party data collection.
- New York Times. How Companies Learn Your Secrets
- Andrew Odlyzko. Privacy, Economics, and Price Discrimination on the Internet
Theme: third-party data collection.
- Julia Angwin. The Web’s New Gold Mine: Your Secrets (First in the Wall Street Journal’s What They Know series)
- Jonathan R. Mayer and John C. Mitchell. Third-Party Web Tracking: Policy and Technology
Theme: why companies act the way they do.
- Joseph Bonneau and Sören Preibusch. The Privacy Jungle: On the Market for Data Protection in Social Networks
- Bruce Schneier. How Security Companies Sucker Us With Lemons (WIRED)
Theme: why people act the way they do.
- Alessandro Acquisti and Jens Grossklags. What Can Behavioral Economics Teach Us About Privacy?
- Alessandro Acquisti. Privacy in Electronic Commerce and the Economics of Immediate Gratification
This section is rather self-explanatory. After the math-y flavor of the first section, this one has a good amount of economics, behavioral economics, and policy. One of the thought exercises was to project current trends into the future and imagine what ubiquitous tracking might lead to in five or ten years.
3. Anonymity and De-anonymization
Important note: communications anonymity (e.g., Tor) and data anonymity/de-anonymization (e.g., identifying people in digital databases) are technically very different, but we will discuss them together because they raise some of the same ethical questions. Also, Bitcoin lies somewhere in between the two.
- Roger Dingledine, Nick Mathewson, Paul Syverson. Tor: The Second-Generation Onion Router
- Satoshi Nakamoto. Bitcoin: A Peer-to-Peer Electronic Cash System
Tor and Bitcoin (especially the latter) were the hardest but also the most rewarding parts of the class, both for them and for me. Together they took up 4 classes. Bitcoin is extremely challenging to teach because it is technically intricate, the ecosystem is rapidly changing, and a lot of the information is in random blog/forum posts.
In a way, I was betting on Bitcoin by deciding to teach it — if it had died with a whimper, their knowledge of it would be much less relevant. In general I think instructors should choose to make these such bets more often; most curricula are very conservative. I’m glad I did.
- Nils Homer at al. Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays
- [Optional] Arvind Narayanan, Elaine Shi, Benjamin I. P. Rubinstein. Link Prediction by De-anonymization: How We Won the Kaggle Social Network Challenge
It was a challenge to figure out which deanonymization paper to assign. I went with the DNA one because I wanted them to see that deanonymization isn’t a fact about data, but a fact about the world. Another thing I liked about this paper is that they’d have to extract the not-too-complex statistical methodology in this paper from the bioinformatics discussion in which it is embedded. This didn’t go as well as I’d hoped.
I’ve co-authored a few deanonymization papers, but they’re not very well written and/or are poorly suited for pedagogical purposes. The Kaggle paper is one exception, which I made optional.
- Paul Ohm. Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization
- [Optional] Jane Yakowitz Bambauer. Tragedy of the Data Commons
This is another pair of papers with opposing views. Since the latter paper is optional, knowing that most of them wouldn’t have read it, I used the Wiki prompts to raise many of the issues that the author raises.
4. Lightweight Privacy Technologies and New Approaches to Information Privacy
While cryptography is the mechanism of choice for cypherpunk privacy and anonymity tools like Tor, it is too heavy a weapon in other contexts like social networking. In the latter context, it’s not so much users deploying privacy tools to protect themselves against all-powerful adversaries but rather a service provider attempting to cater to a more nuanced understanding of privacy that users bring to the system. The goal of this section is to consider a diverse spectrum of ideas applicable to this latter scenario that have been proposed in recent years in the fields of CS, HCI, law, and more. The technologies here are “lightweight” in comparison to cryptographic tools like Tor.
- Scott Lederer, Jason Hong et al. Personal Privacy through Understanding and Action: Five Pitfalls for Designers
- Franziska Roesner et al. User-Driven Access Control: Rethinking Permission Granting in Modern Operating Systems
- Fred Stutzman and Woodrow Hartzog. Obscurity by Design: An Approach to Building Privacy into Social Media
- Woodrow Hartzog and Fred Stutzman. The Case for Online Obscurity
- Jerry Kang et al. Self-surveillance Privacy
- [Optional] Ryan Calo. Against Notice Skepticism In Privacy (And Elsewhere)
- Helen Nissenbaum. A Contextual Approach to Privacy Online
5. Purely technological approaches revisited
This final section doesn’t have a coherent theme (and I admitted as much in class). My goal with the first two papers was to contrast a privacy problem which seems amenable to a purely or primarily technological formulation and solution (statistical queries over databases of sensitive personal information) with one where such attempts have been less successful (the decentralized, own-your-data approach to social networking and e-commerce).
- Differential Privacy. (Lecture)
- Cynthia Dwork. Differential Privacy.
Differential privacy is another topic that is sorely lacking in expository material, especially from the point of view of students who’ve never done crypto before. So this was again a lecture.
- Arvind Narayanan et al. A Critical Look at Decentralized Personal Data Architectures
- John Perry Barlow A Declaration of the Independence of Cyberspace (short essay, 1996)
- James Grimmelmann. Sealand, HavenCo, and the Rule of Law
These two essays aren’t directly related to privacy. One of the recurring threads in this course is the debate between purely technological and legal or other approaches to privacy; the theme here is to generalize it to a context broader than privacy. The Barlow essay asserts the exceptionalism of Cyberspace as an unregulable medium, whereas the Grimmelmann paper provides a much more nuanced view of the relationship between the law and new technological frontiers.
I’m making available the entire set of Wiki discussion prompts for the class (HTML/PDF). I consider this integral to the syllabus, for it shapes the discussion very significantly. I really hope other instructors and students find this useful as a teaching/study guide. For reference, each set of prompts (one set per class) took me about three hours to write on average.
There are many more things I want to share about this class: the major take-home ideas, the rationale for the Wiki discussion format, the feedback I got from students, a description of a couple of student projects, some thoughts on the sociology of different communities studying privacy and how that impacted the class, and finally, links to similar courses that are being taught elsewhere. I’ll probably close this series with a round-up post including as many of the above topics as I can.
Last semester I taught a course on privacy technologies. Since it was a seminar, the class was a small, self-selected group of very motivated students. Based on the feedback, it seems to have been a success; it was certainly quite personally gratifying for me. This is the first in a series of posts on what I learnt from teaching this course. In this post I will discuss some major misconceptions about privacy, how to refute them, and why it is important to do this right at the beginning of the course.
Privacy’s primary pitfalls
Instructors are often confronted with breaking down faulty mental models that students bring into class before actual learning can happen. This is especially true of the topic at hand. Luckily, misconceptions about privacy are so pervasive in the media and among the general public that it wasn’t too hard to identify the most common ones before the start of the course. And it didn’t take much class discussion to confirm that my students weren’t somehow exempt from these beliefs.
One cluster of myths is about the supposed lack of importance of privacy. 1. “There is no privacy in the digital age.” This is the most common and perhaps the most grotesquely fallacious of the misconceptions; more on this below. 2. “No one cares about privacy any more” (variant: young people don’t care about privacy.) 3. “If you haven’t done anything wrong you have nothing to hide.”
A second cluster of fallacious beliefs is very common among computer scientists and comes from the tendency to reduce everything to a black-and-white technical problem. In this view, privacy maps directly to access control and cryptography is the main technical mechanism for achieving privacy. It’s a view in which the world is full of adversaries and there is no room for obscurity or nontechnical ways of improving privacy.
The first step in learning is to unlearn
Why is it important to spend time confronting faulty mental models? Why not simply teach the “right” ones? In my case, there was a particularly acute reason — to the extent that students believe that privacy is dead and that learning about privacy technologies is unimportant, they are not going to be invested in the class, which would be really bad. But even in the case of misconceptions that don’t lead to students doubting the fundamental premise of the class, there is a surprising reason why unlearning is important.
A famous experiment in the ’80s (I really really recommend reading the linked text) demonstrated what we now know about the ineffectiveness of the “information transmission” model of teaching. The researchers interviewed students after any of four introductory physics courses, and determined that they hadn’t actually learned what had been taught, such as Newton’s laws of motion; instead they just learned to pass the tests. When the researchers sat down with students to find out why, here’s what they found:
What they heard astonished them: many of the students still refused to give up their mistaken ideas about motion. Instead, they argued that the experiment they had just witnessed did not exactly apply to the law of motion in question; it was a special case, or it didn’t quite fit the mistaken theory or law that they held as true.
A special case! Ha. What’s going on here? Well, learning new facts is easy. On the other hand, updating mental models is so cognitively expensive that we go to absurd lengths to avoid doing so. The societal-scale analog of this extreme reluctance is well-illustrated by the history of science — we patched the Ptolemaic model of the Universe, with the Earth at the center, for over a millennium before we were forced to accept that the Copernican system fit observations better.
The instructor’s arsenal
The good news is that the instructor can utilize many effective strategies that fall under the umbrella of active learning. Ken Bain’s excellent book (which the preceding text describing the experiment is from) lays out a pattern in which the instructor creates an expectation failure, a situation in which existing mental models of reality will lead to faulty expectations. One of the prerequisites for this to work, according to the book, is to get students to care.
Bain argues that expectation failure, done right, can be so powerful that students might need emotional support to cope. Fortunately, this wasn’t necessary in my class, but I have no doubt of it based on my personal experiences. For instance, back when I was in high school, learning how the Internet actually worked and realizing that my intuitions about the network had to be discarded entirely was such a disturbing experience that I remember my feelings to this day.
Let’s look at an example of expectation failure in my privacy class. To refute the “privacy is dying” myth, I found it useful to talk about Fifty Shades of Grey — specifically, why it succeeded even though publishers initially passed on it. One answer seems to be that since it was first self-published as an e-book, it allowed readers to be discreet and avoid the stigma associated with the genre. (But following its runaway success in that form, the stigma disappeared, and it was released in paper form and flew off the shelves.)
The relative privacy of e-books from prying strangers is one of the many ways in which digital technology affords more privacy for specific activities. Confronting students with an observed phenomenon whose explanation involves a fact that seems starkly contrary to the popular narrative creates an expectation failure. Telling personal stories about how technology has either improved or eroded privacy, and eliciting such stories from students, gets them to care. Once this has been accomplished, it’s productive to get into a nuanced discussion of how to reconcile the two views with each other, different meanings of privacy (e.g., tracking of reading habits), how the Internet has affected each, and how society is adjusting to the changing technological landscape.
I’m quite new to teaching — this is only my second semester at Princeton — but it’s been exciting to internalize the fact that learning is something that can be studied scientifically and teaching is an activity that can vary dramatically in effectiveness. I’m looking forward to getting better at it and experimenting with different methods. In the next post I will share some thoughts on the content of my course and what I tried to get students to take home from it.
Thanks to Josh Hug for reviewing a draft.
This is the second in a series of posts with advice for computer science academic job candidates.
One shot, one opportunity
The philosopher Marshall Mathers once asked rhetorically, “Look, if you had one shot, or one opportunity / To seize everything you ever wanted in one moment / Would you capture it or just let it slip?”
He added, “Yo.” 
I don’t mean to imply that an academic position is everything you ever wanted, but it’s a pretty good life (although not for these reasons). Like it or not, it’s set up so that your career up until this point comes down to one moment. After years of hard work, your ability as a researcher will be judged primarily based on how you sell yourself in the fleeting span of an hour. Of course, you’ll (hopefully) give your talk at many places, but it’s going to be the same talk!
There’s a reason I’m saying this, and it’s not to stress you out even more. Rather, if at any point the level of preparedness that I suggest seems excessive or disproportionate, remember the wise words quoted above.
Public speaking is a performance
My first piece of advice is to read the book Confessions of a Public Speaker. As in, don’t even think about giving your job talk without having read it. You can read it in a sitting; putting it into practice will of course take longer. I cannot overstate the impact this book had on my talk (and my public speaking in general). There are probably other books that capture much of the wisdom in Confessions, and I’d love to hear other recommendations, but if you’re going to read one book it would have to be this one.
There are numerous very useful little details in the book, but it has one central idea that can be boiled down to the phrase “public speaking is a performance.” Job talks are are even more of a performance than public speaking in general, since the audience is specifically there to judge you. This is a generative metaphor — it allows you predict things about your job talk based on what you know about performing. Fully appreciating the metaphor will require reading the book, but here are two such predictions that might otherwise be surprising.
Your first priority is to entertain
Certainly you must both entertain and inform, but the point is that you don’t really have a shot at the latter if you fail at the former. Sitting in a lecture, as everyone who remembers their student days is surely aware, can be excruciating; it’s an extremely unnatural situation from an evolutionary perspective (again, read Confessions to appreciate why.) The chart below from the book What’s the Use of Lectures? shows students’ heart rate over time as they sat in a lecture. It’s only a drop of a few beats per minute, but it translates to an enormous difference in alertness.If you don’t do anything different in your talk and simply present your material, your audience’s attention level will be greatly diminished by the half-hour mark, and by the end of your talk people will basically be comatose. Anything you can do to break the routine, linear, hyper-boring pattern of a lecture will help jolt the audience out of their stupor. (That includes asking questions — I usually asked two or three in my talk.) Otherwise they won’t be excited about you nor remember much after the talk.
Rehearse, rehearse, rehearse
You may have heard “practice, practice, practice.” I’d rather cast it in the language of performance, as there are some subtle differences. For example, when people tell you to practice, they tell you not to overdo it because you’d lose your spontaneity. I disagree. In a rehearsal, everything is practiced down to the last detail. In fact, the apparently spontaneous things that I said my talk were the most well-rehearsed parts.
Rehearsal should include videotaping yourself and watching it. Yes, it’s painful and majorly cringe-inducing, but it’s absolutely, absolutely essential. In addition to all the obvious facets of good presentation style that I won’t repeat, one of the subtle but important things you should watch for is nervous tics or other repetitive behaviors — almost everyone has one or more of those, and they can almost derail your talk by distracting your audience.
The reason rehearsal makes such a huge difference is that when you’re delivering a rehearsed talk, your every word and gesture is subconscious, freeing up your mental bandwidth for observing and reacting in real-time to the facial expressions of your audience. There are never more than 40-50 people in these talks, a small enough number that you can instantly notice if someone looks confused, skeptical, or bored. But this won’t be possible if you have to think through your slides instead. The reduction in cognitive load also minimizes the chance of “hitting the wall,” a phenomenon of sudden mental fatigue that’s a serious danger in long-ish talks and can leave you helpless.
Let me close with an example of a little theatrical thing I did that shows the value of rehearsal and the performance metaphor. One of the goals in my location privacy project is to minimize smartphone power consumption. When I got to that part, I’d say, “those of you with Android phones know how bad the battery life is. In fact I usually carry a spare battery around… actually, I think I have it on me.” Then I’d pull a smartphone battery out of my jacket pocket with a bit of a dramatic touch. Somehow the use of a physical prop seemed to reframe their thinking from “yet another academic paper” to “solving a real problem.” It would also usually elicit a laugh and elevate their attention level.
There is so much more to say about job talks, not to mention other aspects of the job interview. I might do follow up posts on a mathematical model of audience behavior and/or an explanation of why slide transitions are (by far) the most important part of your slides.
 This post was written while listening to Lose Yourself in a loop.
 If it needs to be said, I have no stake in the book, financial or otherwise.
 I hasten to add that teaching is very different from public speaking and is emphatically not a performance.
In my previous article I pointed out that online price discrimination is suspiciously absent in directly observable form, even though covert price discrimination is everywhere. Now let’s talk about why that might be.
By “covert” I don’t mean that the firm is trying to keep price discrimination a secret. Rather, I mean that the differential treatment isn’t made explicit — e.g., by not basing it directly on a customer attribute — and thereby avoiding triggering the perception of unfairness or discrimination. A common example is selective distribution of coupons instead of listing different prices. Such discounting may be publicized, but it is still covert.
The perception of fairness
The perception of fairness or unfairness, then, is at the heart of what’s going on. Going back to the WSJ piece, I found it interesting to see the reaction of the customer to whom Staples quoted $1.50 more for a stapler based on her ZIP code: “How can they get away with that?” she asks. To which my initial reaction was, “Get away with what, exactly? Supply and demand? Econ 101?”
Even though some of us might not feel the same outrage, I think all of us share at least a vague sense of unease about overt price discrimination. So I decided to dig deeper into the literature in psychology, marketing, and behavioral economics on the topic of price fairness and understand where this perception comes from. What I found surprised me.
First, the fairness heuristic is quite elaborate and complex. In a vast literature spanning several decades, early work such as the “principle of dual entitlement” by Kahneman and coauthors established some basics. Quoting Anderson and Simester: “This theory argues that customers’ have perceived fairness levels for both ﬁrm proﬁts and retail prices. Although ﬁrms are entitled to earn a fair proﬁt, customers are also entitled to a fair price. Deviations from a fair price can be justiﬁed only by the ﬁrm’s need to maintain a fair proﬁt. According to this argument, it is fair for retailers to raise the price of snow shovels if the wholesale price increases, but it is not fair to do so if a snowstorm leads to excess demand.”
Much later work has added to and refined that model. A particularly impressive and highly cited 2004 paper reviews the literature and proposes an elaborate framework with four different classes inputs to explain how people decide if pricing is fair or unfair in various situations. Some of the findings are quite surprising. For example: in case of differential pricing to the buyer’s disadvantage, “trust in the seller has a U-shaped effect on price fairness perceptions.”
The illusion of fairness
Sounds like we have a well-honed and sophisticated decision procedure, then? Quite the opposite, actually. The fairness heuristic seems to be rather fragile, even if complex.
Let’s start with an example. Andrew Odlyzko, in his brilliant essay on price discrimination — all the more for the fact that it was published back in 2003  — has this to say about Coca Cola’s ill-fated plans for price-adjusting vending machines: “In retrospect, Coca Cola’s main problem was that news coverage always referred to its work as leading to vending machines that would raise prices in warm weather. Had it managed to control publicity and present its work as leading to machines that would lower prices in cold weather, it might have avoided the entire controversy.”
We know how to explain the public’s reaction to the Coca Cola announcement using behavioral economics — the way it was presented (or framed), customers take the lower price as the “reference price,” and the price increase seems unfair, whereas the Odlyzko’s suggested framing would anchor the higher price as the reference price. Of course, just because we can explain how the fairness heuristic works doesn’t make it logical or consistent, let alone properly grounded in social justice.
More generally, every aspect of our mental price fairness assessment heuristic seems similarly vulnerable to hijacking by tweaking the presentation of the transaction without changing the essence of price discrimination. Companies have of course gotten wise to this; there’s even academic literature on it. One of the techniques proposed in this paper is “reference group signaling” — getting a customer to change the set of other customers to whom they mentally compare themselves. 
The perception of fairness, then, can be more properly called the illusion of fairness.
The fragility of the fairness heuristic becomes less surprising considering that we apparently share it with other primates. This hilarious clip from a TED talk shows a capuchin monkey reacting poorly, to put it mildly, to differential treatment in a monkey-commerce setting (although the jury may still be out on the significance of this experiment). If our reaction to pricing schemes is partly or largely due to brain circuitry that evolved millions of years ago, we shouldn’t expect it to fare well when faced with the complexities of modern business.
Given that the prime impediment to pervasive online price discrimination is a moral principle that is fickle and easily circumventable, one can expect that companies to do exactly that, since they can reap most of the benefits of price discrimination without the negative PR. Indeed, it is my belief that more covert price discrimination is going on than is generally recognized, and that it is accelerating due to some technological developments.
This is a problem because price discrimination does raise ethical concerns, and these concerns are every bit as significant when it is covert.  However, since it is much less transparent, there’s less of an opportunity for public debate.
There are two directions in which I want to take this series of articles from this point: first a look at how new technology is enabling powerful forms of tailoring and covert price discrimination, and second, a discussion of what can be done to make price discrimination more transparent and how to have an informed policy discussion about its benefits and dangers.
 I had the pleasure of sitting next to Professor Odlyzko at a conference dinner once, and I expressed my admiration of the prescience of his article. He replied that he’d worked it all out in his head circa 1996 but took a few years to put it down on paper. I could only stare at him wordlessly.
 I’m struck by the similarities between price fairness perceptions and privacy perceptions. The aforementioned 2004 price fairness framework can be seen as serving a roughly analogous function to contextual integrity, which is (in part) a theory of consumer privacy expectations. Both these theories are the result of “reverse engineering,” if you will, of the complex mental models in their respective domains using empirical behavioral evidence. Continuing the analogy, privacy expectations are also fragile, highly susceptible to framing, and liable to be exploited by companies. Acquisti and Grossklags, among others, have done some excellent empirical work on this.
 In fact, crude ways of making customers reveal their price sensitivity lead to a much higher social cost than overt price discrimination. I will take this up in more detail in a future post.
The mystery about online price discrimination is why so little of it seems to be happening.
Consumer advocates and journalists among others have been trying to find smoking gun evidence of price discrimination — the overt kind where different customers are charged different prices for identical products based on how much they are willing to pay. (By contrast, examples of covert or concealed price discrimination abound; see, for example, my 2011 article.) Back in 2000 Amazon tried a short-lived experiment where prices of DVDs for new and for regular users were different. But that remains essentially the only example.
This should be surprising. Tailoring prices to individuals is far more technically feasible online than offline, since shoppers are either identified or at least have loads of behavioral data associated with their pseudonymous cookies. The online advertising industry claims that this is highly effective for targeting ads; estimating consumers’ willingness to pay shouldn’t be much harder. Clearly, price discrimination has benefits to firms engaging in it by allowing them to capture more of the “consumer surplus.” (Whether or not it is beneficial to consumers is a more controversial question that I will defer to a future post.) In fact, based on technical feasibility and economic benefits, one might expect the practice to be pervasive.
The evidence (or lack thereof)
A study out of Spain last year took a comprehensive look at online merchants, by far the most thorough analysis of its kind. They created two “personas” with different browsing histories — one of which visited discount sites and the other visited sites for luxury products. Each persona then browsed 200 e-commerce sites as well as search engines to see if they were treated differently. Here’s what the authors found:
- There is evidence for search discrimination or steering where the high- and low-income personas are shown ads for high-end and low-end products respectively. In my opinion, the line between this practice and plain old behavioral advertising is very, very slim. 
- There is no evidence for price discrimination based on personas/browsing histories.
- Three of the 200 retailers including Staples varied prices based on the user’s location, but necessarily not in a way that can’t be explained by costs of doing business.
- Visitors coming from one particular deals site (nextag.com) saw lower prices at various retailers. (Discounting and “deals” are very common forms of concealed price discrimination.)
A new investigation by the Wall Street Journal analyzes Staples in more detail. While the Spain study found geographic variation in prices, the WSJ study goes further and shows a strong correlation between lower prices and consumers’ ability to drive to competitors’ stores, which is an indicator of willingness to pay. I’m not 100% convinced that they’ve ruled out alternative hypotheses, but it does seem plausible that Staples’ behavior constitutes actual price discrimination, even though geography is a far cry from utilizing behavioral data about individuals.
Other findings in the WSJ piece are websites that offer discounts for mobile users and location-dependent pricing on Lowe’s and Home Depot’s websites but with little evidence of being based on anything but costs of doing business.
So there we have it. Both studies are very thorough, and I commend the authors, but I consider their results to be mostly negative — very few companies are varying prices at all and none are utilizing anywhere near the full extent of data available about users. Other price discrimination controversies include steering by Orbitz and a hastily-retracted announcement by Coca Cola for vending machines that would tailor prices to demand. Neither company charged or planned to charge different prices for the same product based on who the consumer was.
In short, despite all the hubbub, I find overt price discrimination conspicuous by its absence. In a follow-up post I will propose an explanation for the mystery and see what we can learn from it.
 This is an automatic consequence of collaborative recommendation that suggests products to users based on what similar users have clicked on/purchased in the past. It does not require that any explicit inference of the consumer’s level of affluence be made by the system. In other words, steering, bubbling etc. are inherent features of collaborative filtering algorithms which drive personalization, recommendation and information retrieval on the Internet. This fact greatly complicates attempts to define, detect or regulate unfair discrimination online.
Thanks to Aleecia McDonald for reviewing a draft.
As an academic who’s spent time in the startup world, I see strong similarities between the nature of a scientific research project and the nature of a startup. This boils down the fact that most research projects fail (in a sense that I’ll describe), and even among the successful projects the variance is extremely high — most of the impact is concentrated in a few big winners.
Of course, research projects are clearly unlike startups in some important ways: in research you don’t get to capture the economic benefit of your work; your personal gain from success is not money but academic reputation (unless you commercialize your research and start an actual startup, but that’s not what this post is about at all.) The potential personal downside is also lower for various reasons. But while the differences are obvious, the similarities call for some analysis.
I hope this post is useful to grad students in particular in acquiring a long-term vision for how to approach their research and how to maximize the odds of success. But perhaps others including non-researchers will also find something useful here. There are many aspects of research that may appear confusing or pathological, and at least some of them can be better understood by focusing on the high variance in research impact.
1. Most research projects fail.
To me, publication alone does not constitute success; rather, the goal of a research project is to impact the world, either directly or by influencing future research. Under this definition, the vast majority of research ideas, even if published, are forgotten in a few years. Citation counts estimate impact more accurately , but I think they still significantly underestimate the skew.
The fact that most research projects don’t make a meaningful lasting impact is OK — just as the fact that most startups fail is not an indictment of entrepreneurship.
A researcher might choose to take a self-interested view and not care about impact, but even in this view, merely aiming to get papers published is not a good long-term strategy. For example, during my recent interview tour, I got a glimpse into how candidates are evaluated, and I don’t think someone with a slew of meaningless publications would have gotten very far. 
2. Grad students: diversify your portfolio!
Given that failure is likely (and for reasons you can’t necessarily control), spending your whole Ph.D. trying to crack one hard problem is a highly risky strategy. Instead, you should work on multiple projects during your Ph.D., at least at the beginning. This can be either sequential or parallel; the former is more similar to the startup paradigm (“fail-fast”).
I achieved diversity by accident. Halfway through my Ph.D. there were at least half a dozen disparate research topics where I’d made some headway (some publications, some works in progress, some promising ideas). Although I felt I was directionless, this turned out to be the right approach in retrospect. I caught a lucky break on one of them — anonymity in sanitized databases — because of the Netflix Prize dataset, and from then on I doubled down to focus on deanonymization. This breadth-then-depth approach paid off.
3. Go for the big hits.
Paul Graham’s fascinating essay Black Swan Farming is about how skewed the returns are in early-stage startup investing. Just two of the several hundred companies that YCombinator has funded are responsible for 75% of the returns, and in each batch one company outshines all the rest.
The returns from research aren’t quite as skewed, but they’re skewed enough to be highly counterintuitive. This means researchers must explicitly account for the skew in selecting problems to work on. Following one’s intuition and/or the crowd is likely to lead to a mediocre career filled with incremental, marginally publishable results. The goal is to do something that’s not just new and interesting, but which people will remember in ten years, and the latter can’t necessarily be predicted based on the amount of buzz a problem is generating in the community right now. Breakthroughs often come from unsexy problems (more on that below).
There’s a bit of a tension between going for the hits and diversifying your portfolio. If you work on too few projects, you incur the risk that none of them will pan out. If you work on too many, you spread yourself too thin, the quality of each one suffers, and lowers the chance that at least one of them will be a big hit. Everyone must find their own sweet spot. One piece of advice given to junior professors is to “learn to say no.”
4. Find good ideas that look like bad ideas.
How do you predict if an idea you have is likely to lead to success, especially a big one? Again let’s turn to Paul Graham in Black Swan Farming:
“the best startup ideas seem at first like bad ideas. … if a good idea were obviously good, someone else would already have done it. So the most successful founders tend to work on ideas that few beside them realize are good.”
Something very similar is true in research. There are some problems that everyone realizes are important. If you want to solve such a problem, you have to be smarter than most others working on it and be at least a little bit lucky. Craig Gentry, for example, invented Fully Homomorphic Encryption mostly by being very, very smart.
Then there are research problems that are analogous to Graham’s good ideas that initially look bad. These fall into two categories: 1. research problems that no one has realized are important 2. problems that everyone considers prohibitively difficult but which turn out to have a back door.
If you feel you are in a position to take on obviously important problems, more power to you. I try to work on problems that everyone seems to think are bad ideas (either unimportant or too difficult), but where I have some “unfair advantage” that leads me to think otherwise. Of course, a lot of the time they are right, but sometimes they are not. Let me give two examples.
I consider Adnostic (online behavioral advertising without tracking) to be moderately successful: it has had an impact on other research in the area, as well as in policy circles as an existence proof of behavioral-advertising-with-privacy. Now, my coauthors started working on it before I joined them, so I can take none of the credit for problem selection. But it’s a good illustration of the principle. The main reason they decided this problem was important was that privacy advocates were up in arms about online tracking. Almost no one in the computer science community was studying the topic, because they felt that simply blocking trackers was an adequate solution. So this was a case of picking a problem that people didn’t realize was important. Three years later it’s become a very crowded research space.
Another example is my work with Shmatikov on deanonymizing social networks by being able to find a matching between the nodes of two social graphs. Most people I talked to at the time thought this was impossible — after all, it’s a much harder version of graph isomorphism, and we’re talking about graphs with millions of nodes. Here’s the catch: people intuitively think graph isomorphism is “hard,” but it is in fact not NP-complete and on real-world graphs it embarrassingly easy. We knew this, and even though the social network matching problem is harder than graph isomorphism, we thought it was still doable. In the end it took months of work, but fortunately it was just within the realm of possibility.
5. Most researchers are known for only one or two things.
Let me end with an interesting side effect of the high-skew theory: a successful researcher may have worked on many successful projects during their career, but the top one or two of those will likely be far better known than the rest. This seems to be borne out empirically, and a source of much annoyance for many researchers to be pigeonholed as “the person who did X.” Let’s take Ron Rivest who’s been prolific for several decades not just in cryptography but also in algorithms and lately in voting. Most computer scientists will recall that he’s the R in RSA, but knowledge of his work drops off sharply after that. This is also reflected in the citation counts (the first entry is a textbook, not a research paper). 
In summary, if you’re a researcher, think carefully about which projects to work on and what the individual and overall chances of success are. And if you’re someone who’s skeptical about academia because your friend who dropped out of a Ph.D. after their project failed convinced you that all research is useless, I hope this post got you to think twice.
I may do a follow-up post examining whether ideas are as valuable as they are held to be in the research community, or whether research ideas are more similar to startup ideas in that it’s really execution and selling that lead to success.
 For example, a quarter of my papers are responsible for over 80% of my citations.
 That said, I will get a much better idea in the next few months from the other side of the table :)
 Specifically, it undermines the “we can’t stop tracking because it would kill our business model” argument that companies love to make when faced with pressure from privacy advocates and regulators.
 To be clear, my point is that Rivest’s citation counts drop off relative to his most well-known works.
Thanks to Joe Bonneau for comments on a draft.