Search engines are the lifeblood of the modern Internet – a vast majority of sites depend on them for visibility, and a majority of Internet users depend on them to find whatever they wish.
Entire industries have been set up around the search engine, with SEO being one of the most recognizable. However, there is a small niche in the system of search engines that does not concern itself with things found by the engines’ crawlers. Instead, it defines the things that should not be included in a search engine’s results – known as robots.txt
What is robots.txt, and When Is It used ?
The exclusion is a standard that began back in 1994, conceived by Martijn Koster. It is also known as the “robots exclusion standard” or “robots exclusion protocol”, and is used by websites to define which areas of their domain should not be searched.
The protocol is read by web crawlers and other web robots, allowing them to better categorize the websites. As it became a de facto standard, most search engines of the time complied with it including pioneers WebCrawler, AltaVista, and Lycos.
One history of the protocol says it was thought of after a badly-written crawler inadvertently caused a denial of service attack on Koster’s website. Today, however, robots.txt is predominantly used for the following reasons:
- When the website wishes to have privacy from the search engine results
- When the site contains certain content that may mislead the crawler, or cause the site to be wrongly categorized
- When the site wants the robot to operate only using specific data
The standard appears as a text file (titled “robots.txt”) within the web site hierarchy’s root. This will contain instructions written in a specific format, telling the crawler which parts of the site it should not process. If the file does not exist, the search engine will assume that the site owner allows it to comb the entire site.
Despite being a “protocol”, robots.txt is simply an advisory. Search engines and web robots may or may not follow it, at their discretion. In fact, some malicious robots such as spambots, email harvesters, and other malware may actually use robots.txt as a directory that tells it which parts of the site it should crawl first. As such, the National Institute of Standards and Technology (NIST) recommends against relying on robots.txt as a means of privacy or security.
Also, using robots.txt requires the web master to place it at every origin. For example, websites with multiple subdomains will need a robots.txt file placed within each of them — placing it under “samplesite.com” will not cover “x.samplesite.com” and “y.samplesite.com”. In the same vein, each protocol and port will need its own file — creating a “http://www.samplesite.com/robots.txt” will not apply to “https://www.samplesite.com” or “http://www.samplesite.com:8080/”.
Several crawlers also support an extension called “crawl-deay”, a parameter that allows the webmaster to set the number of seconds between the robot’s successive requests to the server. This is useful for low-capacity servers, to avoid DoS instances like the one described earlier. This is inserted at the bottom of the file.
Major crawlers also support the “allow” directive, to counteract the “disallow” parameter that appears in the standard file. This is useful when the web master wants to avoid an entire directory except for a few specific files therein (i.e., Allow: /sample/directory/file/1 Disallow: /sample/directory).
Finally, a few crawlers (notably Google and Yandex) allow for a “host” directive, allowing sites with more than one mirror to specify their preferred domain. If supported by the crawler, it should be inserted below the crawl-delay.
When utilized properly, these extensions and the entire exclusion system can be used to optimize your site, allowing for better visibility and usability by both search engines and its users.