Monday, 1 July 2019

Formalizing the Robots Exclusion Ptotocol Specification


For 25 years, the Robots Exclusion Protocol (REP) has been one of the most basic and critical components of the web. It allows website owners to exclude automated clients, for example web crawlers, from accessing their sites - either partially or completely.

In 1994, Martijn Koster (a webmaster himself) created the initial standard after crawlers were overwhelming his site. With more input from other webmasters, the REP was born, and it was adopted by search engines to help website owners manage their server resources easier.

However, the REP was never turned into an official Internet standard, which means that developers have interpreted the protocol somewhat differently over the years. And since its inception, the REP hasn't been updated to cover today's corner cases. This is a challenging problem for website owners because the ambiguous de-facto standard made it difficult to write the rules correctly.

We wanted to help website owners and developers create amazing experiences on the internet instead of worrying about how to control crawlers. Together with the original author of the protocol, webmasters, and other search engines, we've documented how the REP is used on the modern web, and submitted it to the IETF.

The proposed REP draft reflects over 20 years of real world experience of relying on robots.txt rules, used both by Googlebot and other major crawlers, as well as about half a billion websites that rely on REP. These fine grained controls give the publisher the power to decide what they'd like to be crawled on their site and potentially shown to interested users. It doesn't change the rules created in 1994, but rather defines essentially all undefined scenarios for robots.txt parsing and matching, and extends it for the modern web. Notably:

  1. Any URI based transfer protocol can use robots.txt. For example, it's not limited to HTTP anymore and can be used for FTP or CoAP as well.

  2. Developers must parse at least the first 500 kibibytes of a robots.txt. Defining a maximum file size ensures that connections are not open for too long, alleviating unnecessary strain on servers.

  3. A new maximum caching time of 24 hours or cache directive value if available, gives website owners the flexibility to update their robots.txt whenever they want, and crawlers aren't overloading websites with robots.txt requests. For example, in the case of HTTP, Cache-Control headers could be used for determining caching time.

  4. The specification now provisions that when a previously accessible robots.txt file becomes inaccessible due to server failures, known disallowed pages are not crawled for a reasonably long period of time.


Additionally, we've updated the augmented Backus–Naur form in the internet draft to better define the syntax of robots.txt, which is critical for developers to parse the lines.

RFC stands for Request for Comments, and we mean it: we uploaded the draft to IETF to get feedback from developers who care about the basic building blocks of the internet. As we work to give web creators the controls they need to tell us how much information they want to make available to Googlebot, and by extension, eligible to appear in Search, we have to make sure we get this right.

If you'd like to drop us a comment, ask us questions, or just say hi, you can find us on Twitter and in our Webmaster Community, both offline and online.

No comments: