Robots.txt – Définition
The robots.txt file is a text file located at the root of your website (usually created by a webmaster who has access to the server), which tells search engine robots how to crawl your site. These instructions tell a robot if it is ‘Authorised/allowed’ or ‘Unauthorised/disallowed’ to crawl the parts of the website indicated. From this point on you will give specific orders to these crawler robots and this file becomes vital for your SEO. Using it the wrong way can have a huge impact on your ranking.
This file consists of lines of instructions for robot crawlers and search engines. This file will be ‘read’ by the robots before they read the rest of your site. After browsing this file, the robot will follow the instructions it has read and crawl your site on that basis.
The structure of a robots.txt file
A robots.txt file has a very simple structure. Here is the most basic format:
User-agent:[robot crawler name] Disallow: [URL not to be crawled or referenced]
These two lines can be considered a complete robots.txt file.
You can give different instructions in a robots.txt file depending on the robot crawler.
For example, Wikipedia shows 4 different instructions for 4 different user agents. (MJ12Bot / Mediapartners-Google * / IsraBot / Ortograffe). In a robots.txt file that contains multiple user agents, each prohibition or authorisation rule applies only to user agents that are specified in the set without a blank line.
Each search engine has its own crawler robots! For example:
- Google – Googlebot
- Bing – bingbot
- Yahoo! – Slurp
- DuckDuckGo – DuckDuckBot
- Baidu – Baiduspider
If you are looking for a user-agent (a way to identify one crawler or a set of crawlers) in particular, go http://www.user-agents.org/
Examples of robots.txt – How to understand a robots.txt file
In this section, we will look at some examples of robots.txt to help you understand a little more in depth how a robots.txt file works.
Your site: www.mysite.com
URL of the robots.txt of your site: www.mysite.com/robots.txt
Example 1 – Block all crawlers from your site
User-agent: * Disallow: /
Using this robots.txt tells ALL crawler robots not to crawl your site (including your homepage). The ‘*’ refers to ALL crawler robots.
Example 2 – Authorising all robots to crawl your site
User-agent: * Allow: /
Using this robots.txt tells ALL robot crawlers to explore all of your site (your home page included).
Example 3 – Prevent a particular crawler from crawling a particular subfolder
User-agent: Googlebot Disallow: /exemple-sousdossier/
This robots.txt instructs Google’s crawler (User-Agent: Googlebot) not to crawl the specified subfolder (Disallow: / example-subfolder /)
Example 4 – Prevent a particular crawler from searching a particular page
User-agent: Slurp Disallow: /exemple-sousdossier/page-a-bloquer.html
This robots.txt file instructs Yahoo! crawler robots (User-agent: Slurp) not to crawl one page in particular (Disallow: /example-subfolder/page-to-block.html)
These 4 examples should give you an understanding of the logic behind robots.txt files! If it’s still not clear, you can read this Google guide to robots.txt
How to check whether your site has a robots.txt file
Not sure whether your site has a robots.txt file? It’s easy to check. Type your root domain, and then add /robots.txt at the end of the URL. For example, the robots.txt file from our site is at edit-place.co.uk/robots.txt.
If nothing appears, it means that you currently have no robots.txt file online.
How to create a robots.txt file
If you don’t have a robots.txt file on your site or you want to modify the existing one, we would advise you to read this Google article which sets out the process of creating a robots.txt file. Once created, you can test your robots.txt file with the Search Console.
Advice / Tips on robots.txt files
- To be read by crawlers, your robots.txt file must be at the root of your website.
- The robots.txt filename is case-sensitive, so be exact – it’s “robots.txt” not “Robots.txt” or “robot.txt” or whatever 🙂
- Some crawlers may completely ignore the instructions given by your robots.txt file. This is common with malicious crawlers (lifting of email addresses, site hacking, etc.)
- The robots.txt file is public! You can access any such file by adding “/robots.txt/” at the end of a domain name in your browser! Try it with your favourite sites 🙂
- /!\ Be careful though, do not store ANY personal information in your robots.txt file! It’s public!
- Google does not require you to own a robots.txt, but it is strongly recommended!
Otherwise, it is generally advisable to include the sitemap address of your site in your robots.txt file. Since search engine bots start to crawl your site by checking its robots.txt file, this gives you the opportunity bring your sitemap.xml to their attention.
Just add this line to your robots.txt file:
And if the term sitemap doesn’t mean anything to you, find out in this article how to create and submit a sitemap to Google.