Posts tagged ‘research’
As an academic who’s spent time in the startup world, I see strong similarities between the nature of a scientific research project and the nature of a startup. This boils down the fact that most research projects fail (in a sense that I’ll describe), and even among the successful projects the variance is extremely high — most of the impact is concentrated in a few big winners.
Of course, research projects are clearly unlike startups in some important ways: in research you don’t get to capture the economic benefit of your work; your personal gain from success is not money but academic reputation (unless you commercialize your research and start an actual startup, but that’s not what this post is about at all.) The potential personal downside is also lower for various reasons. But while the differences are obvious, the similarities call for some analysis.
I hope this post is useful to grad students in particular in acquiring a long-term vision for how to approach their research and how to maximize the odds of success. But perhaps others including non-researchers will also find something useful here. There are many aspects of research that may appear confusing or pathological, and at least some of them can be better understood by focusing on the high variance in research impact.
1. Most research projects fail.
To me, publication alone does not constitute success; rather, the goal of a research project is to impact the world, either directly or by influencing future research. Under this definition, the vast majority of research ideas, even if published, are forgotten in a few years. Citation counts estimate impact more accurately , but I think they still significantly underestimate the skew.
The fact that most research projects don’t make a meaningful lasting impact is OK — just as the fact that most startups fail is not an indictment of entrepreneurship.
A researcher might choose to take a self-interested view and not care about impact, but even in this view, merely aiming to get papers published is not a good long-term strategy. For example, during my recent interview tour, I got a glimpse into how candidates are evaluated, and I don’t think someone with a slew of meaningless publications would have gotten very far. 
2. Grad students: diversify your portfolio!
Given that failure is likely (and for reasons you can’t necessarily control), spending your whole Ph.D. trying to crack one hard problem is a highly risky strategy. Instead, you should work on multiple projects during your Ph.D., at least at the beginning. This can be either sequential or parallel; the former is more similar to the startup paradigm (“fail-fast”).
I achieved diversity by accident. Halfway through my Ph.D. there were at least half a dozen disparate research topics where I’d made some headway (some publications, some works in progress, some promising ideas). Although I felt I was directionless, this turned out to be the right approach in retrospect. I caught a lucky break on one of them — anonymity in sanitized databases — because of the Netflix Prize dataset, and from then on I doubled down to focus on deanonymization. This breadth-then-depth approach paid off.
3. Go for the big hits.
Paul Graham’s fascinating essay Black Swan Farming is about how skewed the returns are in early-stage startup investing. Just two of the several hundred companies that YCombinator has funded are responsible for 75% of the returns, and in each batch one company outshines all the rest.
The returns from research aren’t quite as skewed, but they’re skewed enough to be highly counterintuitive. This means researchers must explicitly account for the skew in selecting problems to work on. Following one’s intuition and/or the crowd is likely to lead to a mediocre career filled with incremental, marginally publishable results. The goal is to do something that’s not just new and interesting, but which people will remember in ten years, and the latter can’t necessarily be predicted based on the amount of buzz a problem is generating in the community right now. Breakthroughs often come from unsexy problems (more on that below).
There’s a bit of a tension between going for the hits and diversifying your portfolio. If you work on too few projects, you incur the risk that none of them will pan out. If you work on too many, you spread yourself too thin, the quality of each one suffers, and lowers the chance that at least one of them will be a big hit. Everyone must find their own sweet spot. One piece of advice given to junior professors is to “learn to say no.”
4. Find good ideas that look like bad ideas.
How do you predict if an idea you have is likely to lead to success, especially a big one? Again let’s turn to Paul Graham in Black Swan Farming:
“the best startup ideas seem at first like bad ideas. … if a good idea were obviously good, someone else would already have done it. So the most successful founders tend to work on ideas that few beside them realize are good.”
Something very similar is true in research. There are some problems that everyone realizes are important. If you want to solve such a problem, you have to be smarter than most others working on it and be at least a little bit lucky. Craig Gentry, for example, invented Fully Homomorphic Encryption mostly by being very, very smart.
Then there are research problems that are analogous to Graham’s good ideas that initially look bad. These fall into two categories: 1. research problems that no one has realized are important 2. problems that everyone considers prohibitively difficult but which turn out to have a back door.
If you feel you are in a position to take on obviously important problems, more power to you. I try to work on problems that everyone seems to think are bad ideas (either unimportant or too difficult), but where I have some “unfair advantage” that leads me to think otherwise. Of course, a lot of the time they are right, but sometimes they are not. Let me give two examples.
I consider Adnostic (online behavioral advertising without tracking) to be moderately successful: it has had an impact on other research in the area, as well as in policy circles as an existence proof of behavioral-advertising-with-privacy. Now, my coauthors started working on it before I joined them, so I can take none of the credit for problem selection. But it’s a good illustration of the principle. The main reason they decided this problem was important was that privacy advocates were up in arms about online tracking. Almost no one in the computer science community was studying the topic, because they felt that simply blocking trackers was an adequate solution. So this was a case of picking a problem that people didn’t realize was important. Three years later it’s become a very crowded research space.
Another example is my work with Shmatikov on deanonymizing social networks by being able to find a matching between the nodes of two social graphs. Most people I talked to at the time thought this was impossible — after all, it’s a much harder version of graph isomorphism, and we’re talking about graphs with millions of nodes. Here’s the catch: people intuitively think graph isomorphism is “hard,” but it is in fact not NP-complete and on real-world graphs it embarrassingly easy. We knew this, and even though the social network matching problem is harder than graph isomorphism, we thought it was still doable. In the end it took months of work, but fortunately it was just within the realm of possibility.
5. Most researchers are known for only one or two things.
Let me end with an interesting side effect of the high-skew theory: a successful researcher may have worked on many successful projects during their career, but the top one or two of those will likely be far better known than the rest. This seems to be borne out empirically, and a source of much annoyance for many researchers to be pigeonholed as “the person who did X.” Let’s take Ron Rivest who’s been prolific for several decades not just in cryptography but also in algorithms and lately in voting. Most computer scientists will recall that he’s the R in RSA, but knowledge of his work drops off sharply after that. This is also reflected in the citation counts (the first entry is a textbook, not a research paper). 
In summary, if you’re a researcher, think carefully about which projects to work on and what the individual and overall chances of success are. And if you’re someone who’s skeptical about academia because your friend who dropped out of a Ph.D. after their project failed convinced you that all research is useless, I hope this post got you to think twice.
I may do a follow-up post examining whether ideas are as valuable as they are held to be in the research community, or whether research ideas are more similar to startup ideas in that it’s really execution and selling that lead to success.
 For example, a quarter of my papers are responsible for over 80% of my citations.
 That said, I will get a much better idea in the next few months from the other side of the table :)
 Specifically, it undermines the “we can’t stop tracking because it would kill our business model” argument that companies love to make when faced with pressure from privacy advocates and regulators.
 To be clear, my point is that Rivest’s citation counts drop off relative to his most well-known works.
Thanks to Joe Bonneau for comments on a draft.
Last week I participated in the Web Privacy Measurement conference at Berkeley. It was a unique event because the community is quite new and this was our very first gathering. The WSJ Data Transparency hackathon is closely related; the Berkeley conference can be thought of as an academic counterpart. So it was doubly fascinating for me — both for the content and because of my interest in the sociology of research communities.
A year ago I explained that there is an information asymmetry when it comes to online privacy, leading to a “market for lemons.” The asymmetry exists for two main reasons: one is that companies don’t disclose what data they collect about you and what they do with it; the second is that even if they do, end users don’t have the capacity to aggregate and process that information and make decisions on the basis of it.
The Web Privacy Measurement community essentially exists to mitigate this asymmetry. The primary goal is to ferret out what is happening to your data online, and a secondary one is making this information useful by pushing for change, building tools for opt-out and control, comparison of different players, etc. The size of the community is an indication of how big the problem has gotten.
Before anyone starts trotting out the old line, “see, the market can solve everything!”, let me point out that the event schedule demonstrates, if anything, the opposite. The majority of what is produced here is intended wholly or partly for the consumption of regulators. Like many others, I found the “What privacy measurement is useful for policymakers?” panel to be the most interesting one. And let’s not forget that most of this is Government-funded research to begin with.
This community is very different from the others that I’ve belonged to. The mix of backgrounds is extraordinary: researchers mainly from computing and law, and a small number from other disciplines. Most of the researchers are academics, but a few work for industrial research labs, a couple are independent, and one or two work in Government. There were also people from companies that make privacy-focused products/services, lawyers, hobbyists, scholars in the humanities, and ad-industry representatives. Overall, the community has a moderately adversarial relationship with industry, naturally, and a positive relationship with the press, regulators and privacy advocates.
The make-up is somewhat similar to the (looser-knit) group of researchers and developers building decentralized architectures for personal data, a direction that my coauthors and I have taken a skeptical view of in this recent paper. In both cases, the raison d’être of the community is to correct the imbalance of power between corporations and the public. There is even some overlap between the two groups of people.
The big difference is that the decentralization community, typified by Diaspora, mostly tries to mount a direct challenge and overthrow the existing order, whereas our community is content to poke, measure, and expose, and hand over our findings to regulators and other interested parties. So our potential upside is lower — we’re not trying to put a stop to online tracking, for example — but the chance that we’ll succeed in our goals is much higher.
Thanks to Aleecia McDonald for reviewing a draft.
I’ve been on several program committees in the last year and a half. As I’ve written earlier, getting a behind-the-scenes look at how things work significantly improved my perception of research and academia. This post is a more elaborate set of observations based on my experience. It is targeted both at my colleagues with the hope of starting a discussion, as well as at outsiders as a continuation of my series on explaining how the scientific community functions (that began with the post linked above) .
Benefits of doing peer review. Peer review is often considered a burden that one grudgingly accepts in order to keep the system working. But in my experience, especially for a junior researcher, the effort is well worth the time.
The most obvious advantage of being on a PC is that it forces you to read papers. Now if you’re the type that never needs external motivation to get things accomplished, this wouldn’t matter to you — you’d do literature study on a regular basis anyway. But many of us aren’t that disciplined; I’m certainly not.
There are also insights you get that you can’t reproduce by having perfect self-discipline. PC work gives you a raw, unfiltered look into the research that people have chosen to work on. This is a 6-month-or-so head start for getting on top of emerging trends compared to only reading published papers. You also get a better idea of common pitfalls to avoid.
Finally, peer review is one of the rare opportunities to read papers critically (it is harder with published work because it doesn’t have as many loopholes). This is not a natural skill for most people — our cognitive biases predispose us to confuse good rhetoric with sound logic.
Which type of meeting? I’ve been on PCs with all three types of discussions: physical meetings, phone meetings and online. I think it’s important to have a meeting, whether physical or phone. I learn a lot, and the outcome feels fairer. Besides, quite often one reviewer is able to point out something the others have missed. Chairs of online-only PCs do try to elicit some interaction between reviewers, but for hard-to-explain but easy-to-understand reasons, the bandwidth in an interactive meeting tends to be much higher.
Phone meetings are suitable for smaller conferences and workshops. In my experience, members mostly tend to go on mute and tune out except when the papers they reviewed are being discussed. I don’t necessarily see a problem with this.
In physical meetings, I’ve found that members often make comments or voice opinions on papers they haven’t really read. I don’t think this is in the best interest of fair reviewing (although I’ve heard a contrary opinion). I wonder if a strategy involving smaller breakout groups would be more effective.
The one advantage of not having a meeting is of course that it saves time. I’ve found that the time commitment for the meeting is about a third of the reviewing time (for both physical and phone meetings), which I don’t consider to be too much of a burden given the improved outcomes.
Overall, my experience from these meetings is that members act professionally for the most part without egos or emotions getting in the way. While there is inevitably some randomness in the process, I believe that the horror stories of careless reviewers — everyone has at least one to narrate — are exaggerated. One possible reason for this misunderstanding is that there is a lot that’s discussed at meetings after the reviews are written, and often this feedback doesn’t make it into the reviews.
Problem areas. Finally, here are some aspects of PCs that I think could be improved. I have deliberately omitted the most common problems (such as an untenable number of submissions and low acceptance rates) that everybody knows and talks about. Instead, these are less frequently discussed but yet (IMO) fairly important issues.
Lost reviews. Since reviewers aren’t perfect, sometimes bad papers with persistent authors manage to get published by being resubmitted to other venues until they hit a relatively sloppy panel of reviewers. The reason this works (when it does) is that past reviews of a recycled paper are “lost”. This is a shame; it wastes reviewer effort and lowers the overall quality of publications.
Community boundaries. As a reviewer I’ve started to realize how difficult it is to publish in other communities’ venues. As an example, at security conferences we often see papers by outsiders that have something useful to say, but are unfortunately inadequately familiar with the “central dogma” of crypto/security research, namely adversarial thinking.  While I can see the temptation to reject these papers with a cursory note, I think we should be patient with these people, explain how we do things and if possible offer to work with them to improve the paper.
Unfruitful directions. Sometimes research directions don’t pan out, either because the world has moved on and the underlying assumptions are no longer true, or because the technical challenges are too hard. But researchers naturally resist having to change their research area, and so there are lots of papers written on topics that stopped being relevant years ago. The reason these papers keep getting published is that they are assigned for review to other people working in the same area. I’ve seen program chairs make an effort to push back on this, but the current situation is far from optimal.
In conclusion, my opinion is that peer review in my community is a relatively well-functioning process, albeit with a lot of scope for improvement. I believe this improvement can be accomplished in an evolutionary way without having to change anything too radically.
 The crypto/security community essentially derives its identity from adversarial thinking. Incidentally, I feel that it is not always suitable for privacy, which is why I believe computer scientists who study privacy should stop viewing ourselves as a subset of the security community.
I’ve had a wonderful time at Stanford these last couple of years, but it’s time to move on. I’m currently in the middle of my job search, looking for faculty and other research positions. In the next month or two I will be interviewing at several places. It’s been an interesting journey.
My Ph.D. years in Austin were productive and blissful. When I finished and came West, I knew I enjoyed research tremendously, but there were many aspects of research culture that made me worry if I’d fit in. I hoped my postdoc would give me some clarity.
Happily, that’s exactly what happened, especially after I started being an active participant in program committees and other community activities. It’s been an enlightening and humbling experience. I’ve come to realize that in many cases, there are perfectly good reasons why frequently-criticized aspects of the culture are just the way they are. Certainly there are still facets that are far from ideal, but my overall view of the culture of scientific research and the value of research to society is dramatically more positive than it was when I graduated.
Let me illustrate. One of my major complaints when I was in grad school was that almost nobody does interdisciplinary research (which is true — the percentage of research papers that span different disciplines is tiny). Then I actually tried doing it, and came to the obvious-in-retrospect realization that collaborating with people who don’t speak your language is hard.
Make no mistake, I’m as committed to cross-disciplinary research as I ever was (I just finished writing a grant proposal with Prof’s Helen Nissenbaum and Deirdre Mulligan). I’ve gradually been getting better at it and I expect to do a lot of it in my career. But if a researcher makes a decision to stick to their sub-discipline, I can’t really fault them for that.
As another example, consider the lack of a “publish-then-filter” model for research papers, a whole two decades after the Web made it technologically straightforward. Many people find this incomprehensibly backward and inefficient. Academia.edu founder Richard Price wrote an article two days ago arguing that the future of peer review will look like a mix of Pagerank and Twitter. Three years ago, that could have been me talking. Today my view is very different.
Science is not a popularity contest; Pagerank is irrelevant as a peer-review mechanism. Basically, scientific peer review is the only process that exists for systematically separating truths from untruths. Like democracy, it has its problems, but at least it works. Social media is probably the worst analogy — it seems to be better at amplifying falsehoods than facts. Wikipedia-style crowdsourcing has its strengths, but it can hit-or-miss.
To be clear, I think peer review is probably going to change; I would like it to be done in public, for one. But even this simple change is fraught with difficulty — how would you ensure that reviewers aren’t influenced by each others’ reviews? This is an important factor in the current system. During my program committee meetings, I came to realize just how many of these little procedures for minimizing bias are built into the system and how seriously people take the spirit of this process. Revamping peer review while keeping what works is going to be slow and challenging.
Moving on, some of my other concerns have been disappearing due to recent events. Restrictive publisher copyrights are a perfect example. I have more of a problem with this than most researchers do — I did my Master’s in India, which means I’ve been on the other side of the paywall. But it looks like that pot may finally have boiled over. I think it’s only a matter of time now before open access becomes the norm in all disciplines.
There are certainly areas where the status quo is not great and not getting any better. Today if a researcher makes a discovery that’s not significant enough to write a paper about, they choose not to share that discovery at all. Unfortunately, this is the rational behavior for a self-interested researcher, because there is no way to get credit for anything other than published papers. Michael Neilsen’s excellent book exploring the future of networked science gives me some hope that change may be on the horizon.
I hope this post has given you a more nuanced appreciation of the nature of scientific research. Misconceptions about research and especially about academia seem to be widespread among the people I talk to both online and offline; I harbored a few myself during my Ph.D., as I said earlier. So I’m thinking of doing posts like this one on a semi-regular basis on this blog or on Google+. But that will probably have to wait until after my job search is done.
This article starts from the example of a simple privacy mishap and argues that the flawed thinking it exposes is a symptom of a deeper malaise and that the structure of privacy research in computer science might require rethinking.
I was surprised by a statement in a recent blog post by Geni, a genealogy-based social networking site, that plainly asserted, “following does not have any privacy implications.” This was in reference to the feature to “follow” a user or profile on the site, which among other things notifies you instantly of new information or activity about the person. (Admirably, however, Geni listened to their users and made some changes to the feature.)
Of course following has privacy implications. Without the follow feature — not just on Geni but on virtually every site that provides an equivalent capability — to obtain the same level of up-to-date information about a person, you’d have to either sit around constantly refreshing their profile or else write a bot that will do that for you and notify you of any updates by email. It is precisely because of this vast difference in the ease of keeping track of people that there was a backlash when Facebook introduced News Feed several years ago.
Why then would anyone claim that following has no privacy implications? The culprit here is “adversarial thinking,” an analytical process that computer scientists and security engineers are trained in. Under this paradigm, users are viewed as all-powerful “adversaries” (limited only by the fundamental computational limits of nature), typically interested in learning as much information about everyone as possible. Clearly, if everyone is an “adversary,” the follow feature makes not a whit of difference, since anyone could create and operate the bot mentioned above with no effort at all.
Weird as it may seem to the uninitiated, adversarial thinking is second nature to computer scientists. It is adversarial thinking that leads to the formulation of privacy as an access-control problem, something that I’ve criticized; the Geni blog post explicitly mentions this as their formulation of privacy. Privacy-as-access-control makes for neat papers but tends to break down quickly in the real world.
Let me be clear: adversarial thinking is a deep and valuable skill that is indispensable in the context that it is meant for — designing cryptosystems. However, it is not always the right paradigm in the privacy context. The theoretical study of database privacy seems to be doing rather well by borrowing methods from cryptography, and I’ve argued in support of adversarial thinking therein. On the other hand, social networking privacy falls squarely in the class of studies in which I find the adversarial approach to have limited value.
There’s a bigger take-away here: the structure of privacy research within computer science might require rethinking. Privacy is currently not considered a first-rate topic but is instead a side-interest of different communities such as security, cryptography and databases/datamining. As a result of this lack of primacy, not only do we frequently use the wrong methods — when all you’ve got is a hammer, everything looks like a nail — we’re also missing out on the chance to borrow from the literature on privacy in fields like law, economics, sociology, and human-computer interaction.
 This is not the only reason why the follow feature has privacy implications. On Livejournal, being followed by people with offensive usernames is sometimes a problem, compounded by the fact that due to the UI, it is not obvious who is following whom. In fact, the privacy changes made by Geni seem intended to address roughly this type of concern rather than the ease-of-tracking issue.
 While the term adversary is standard, adversarial thinking is a term I’ve coined here to describe a somewhat loose collection of axioms (including, for example, Kerckhoff’s principle) that constitute the dominant paradigm of cryptography/security. I don’t think there is an extant term; I’d love to be corrected.
Thanks to Aleksandra Korolova for comments on a draft.