Web Crawlers and Privacy: The Need to Reboot Robots.txt

December 5, 2010 at 7:54 pm 5 comments

This is a position paper I co-authored with Pete Warden and will be discussing at the upcoming IAB/IETF/W3C Internet privacy workshop this week.


Privacy norms, rules and expectations in the real world go far beyond the “public/private dichotomy. Yet in the realm of web crawler access control, we are tied to this binary model via the robots.txt allow/deny rules. This position paper describes some of the resulting problems and argues that it is time for a more sophisticated standard.

The problem: privacy of public data. The first author has argued that individuals often expect privacy constraints on data that is publicly accessible on the web. Some examples of such constraints relevant to the web-crawler context are:

  • Data should not be archived beyond a certain period (or at all).
  • Crawling a small number of pages is allowed, but large-scale aggregation is not.
  • “Linkage of personal information to other databases is prohibited.

Currently there is no way to specify such restrictions in a machine-readable form. As as result, sites resort to hacks such as identifying and blocking crawlers whose behavior they don’t like, without clearly defining acceptable behavior. Other sites specify restrictions in the Terms of Service and bring legal action against violators. This is clearly not a viable solution — for operators of web-scale crawlers, manually interpreting and encoding the ToS restrictions of every site is prohibitively expensive.

There are two reasons why the problem has become pressing: first, there is an ever-increasing quantity of behavioral data about users that is valuable to marketers — in fact, there is even a black market for this data — and second, crawlers have become very cheap to set up and operate.

The desire for control over web content is by no means limited to user privacy concerns. Publishers concerned about copyright are equally in search of a better mechanism for specifying fine-grained restrictions on the collection, storage and dissemination of web content. Many site owners would also like to limit the acceptable uses of data for competitive reasons.

The solution space. Broadly, there are three levels at which access/usage rules may be specified: site-level, page-level and DOM element-level. Robots.txt is an example of a site-level mechanism, and one possible solution is to extend robots.txt. A disadvantage of this approach, however, is that the file may grow too large, especially in sites with user-generated content what may wish to specify per-user policies.

A page-level mechanism thus sounds much more suitable. While there is already a “robots” attribute to the META tag, it is part of the robots.txt specification and has the same limitations on functionality. A different META tag is probably an ideal place for a new standard.

Taking it one step further, tagging at the DOM element-level using microformats to delineate personal information has also been proposed. A possible disadvantage of this approach is the overhead of parsing pages that crawlers will have to incur in order to be compliant.

Conclusion. While the need to move beyond the current robots.txt model is apparent, it is not yet clear what should replace it. The challenge in developing a new standard lies in accommodating the diverse requirements of website operators and precisely defining the semantics of each type of constraint without making it too cumbersome to write a compliant crawler. In parallel with this effort, the development of legal doctrine under which the standard is more easily enforceable is likely to prove invaluable.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

Entry filed under: Uncategorized. Tags: , , , , , .

Adversarial Thinking Considered Harmful (Sometimes) An Academic Wanders into Washington D.C.

5 Comments Add your own

  • 1. Pravin  |  December 5, 2010 at 11:34 pm

    robots.txt might be over-simplistic, but one advantage of that is that its usable. When “extending” robots.txt , you need to make sure you don’t end up designing something like P3P!

    Also, since some of your audience might be non-CS, its very important that they understand that this only protects you from Google/Yahoo/MS etc., and not the secret evil sites. (Its sort of like the evil bit RFC :))

    Reply
  • 2. Jeremy Chatfield  |  December 6, 2010 at 1:04 am

    X-Robots-Tag headers allow cache control, including “not after”. See http://code.google.com/web/controlcrawlindex/docs/robots_meta_tag.html

    There is no control over crawling a part of a web site, but the Server can effect such a control, on a per robot basis, by offering 4xx or 5xx restrictions to crawling on excessive requests. That’s under the control of the Server and doesn’t need a robots.txt extension surely?

    I don’t understand the “linkage” reason. What does that mean? Can you give an example, please?

    Isn’t behavioural data largely in cookies (browser cookies and flash cookies), rather than on pages, unless you mean a social graph/FOAF type content? Isn’t this more a matter of web analytics than robots? I may have missed a way to use bots to infer behaviour, but this doesn’t appear to be a problem.

    And, like Pravin above, surely the more offensive behaviours will be from non-compliant bots. Putting private material behind an authentication API is probably the only way to defeat non-compliant bots – and you haven’t proposed a way to allow legitimate crawlers to authenticate.

    Reply
    • 3. Arvind  |  December 6, 2010 at 8:59 am

      X-Robots-Tag headers allow cache control, including “not after”.

      Interesting, I’ll look into it, thanks.

      There is no control over crawling a part of a web site, but the Server can effect such a control, on a per robot basis, by offering 4xx or 5xx restrictions to crawling on excessive requests. That’s under the control of the Server and doesn’t need a robots.txt extension surely?

      This is very suboptimal. You’re assuming a malicious crawler. I’m mostly interested in the cooperative crawler case, where it is important for the crawler to have visibility into the policies of the server rather than being abruptly cut off.

      Isn’t behavioural data largely in cookies (browser cookies and flash cookies), rather than on pages, unless you mean a social graph/FOAF type content? Isn’t this more a matter of web analytics than robots? I may have missed a way to use bots to infer behaviour, but this doesn’t appear to be a problem.

      I meant things like social networking profile data. Granted, only some of it is “behavioral.” As I mentioned in the post, scraping this data is valuable for marketing purposes.

      And, like Pravin above, surely the more offensive behaviours will be from non-compliant bots. Putting private material behind an authentication API is probably the only way to defeat non-compliant bots – and you haven’t proposed a way to allow legitimate crawlers to authenticate.

      Again I’m not too interested in non-compliant bots; I think those are a lot easier to deal with — like you said earlier, cut them off if they misbehave.

      The hard problem that I’m trying to address is that a bot would like to comply but doesn’t know what the rules are. This scenario is much more common because the firms doing the data scraping have enough PR and legal problems to worry about already. Incidentally, the overemphasis on malicious adversaries is a topic I frequently write about.

      Reply
      • 4. Jeremy Chatfield  |  December 6, 2010 at 10:51 am

        Hmm, so what’s wrong with using robots.txt to deny access to most of the site, specifically allowing access to some parts of the site (controlling quantity exposed), and using the sitemap protocol (sitemap.org) to list URLs that you’d like the bots to reach; applying Meta Robots “NOFOLLOW” to all links in the controlled area to reduce discovery of new URLs for bots to probe, and using meta robots “NOINDEX” on connected pages to make sure that connected information isn’t exposed?

        Finally, simply hiding the material in non-SE friendly AJAX is pretty easy. Bane of my life, actually, making sure that sites expose information in ways that search engines can use. Just stick the FOAF/relationship material in a section of the DOM that is updated; dumb bot scrapers won’t see the information as they don’t (yet) run JavaScript.

        I think that the existing set of tools is sufficient to allow control. I might even set up a test site to prove it – but it’ll be a month or two, as the run-up to Christmas and the New Year is about our busiest time of year – I shouldn’t really spare the time to answer this stuff… But I do enjoy 33bits ;)

        Yes, I have some focus on malicious adversaries; I do a lot of web server log file analysis, and SEO. Bot behaviour and control is crucial for what I do, and I do notice undeclared non-observant bots in logs, poking around parts of the site that aren’t permitted. Most appear to be hunting for, and submitting forms. I also have a particular loathing for spam (mail and web) – and these bots are the ones submitting spam comments, spam emails, spam contact forms and scraping pages for spam-laden Made For AdSense sites. More annoying on a daily basis than behavioural targeted advertising. :)

        Malicious bots piss me off, a lot.

        Reply
    • 5. Arvind  |  December 6, 2010 at 12:02 pm

      Forgot to answer this:

      I don’t understand the “linkage” reason. What does that mean? Can you give an example, please?

      Spokeo is the perfect example. They aggregate data from web crawls and link people’s information that is scattered across different sites. By presenting a person’s previously-fragmented information in a unified interface, it gives the appearance of a detailed dossier and freaks people out (quite reasonably, IMO).

      A site might want to prevent data about its users from being crawled and used for this purpose.

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


About 33bits.org

I'm an assistant professor of computer science at Princeton. I research (and teach) information privacy and security, and moonlight in technology policy.

This is a blog about my research on breaking data anonymization, and more broadly about information privacy, law and policy.

For an explanation of the blog title and more info, see the About page.

Subscribe

Be notified when there's a new post — subscribe to the feed, follow me on Google+ or twitter or use the email subscription box below.

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 213 other followers