Robots.txt | Lesson 8/34 | SEMrush Academy

You’ll gain an understanding of search crawlers and how to optimally budget for them.
Watch the full course for free: https://bit.ly/3gNNZdu

0:18 Robots.txt file
0:39 Limitations
3:07 Misusing Robots.txt
3:31 Debugging Mode
5:11 Checks available in SEMRush Site Audit

✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹
You might find it useful:
Tune up your website’s internal linking with the Site Audit tool:
https://bit.ly/2XVxCmL
Understand how Google bots interact with your website by using the Log File Analyzer:
https://bit.ly/3cs0rfC

Learn how to use SEMrush Site Audit in our free course:
https://bit.ly/2Xsb3XT
✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹ ✹

The robots exclusion standard protocol including robots.txt is probably the oldest way to control the search engine crawler. The robots.txt file can be found in the site’s root directory. It usually specifies parts of a site that search engine crawlers are not supposed to access. The entire file name needs to be written in lowercase, and it resides in yourdomain.com/robots.txt

Robots.txt comes with limitations. The instructions specified in the robots.txt file do not have to be adhered to: while Google, Bing and Co. stick to them, this does not necessarily mean all the other web crawlers (email harvesters etc.) do it. The interpretation of the syntax depends on the crawler. Google is able to handle wildcards whereas other crawlers aren’t (all of the time). The policies specified in a robots.txt file cannot prevent other websites from linking to these URLs or your domain in general. Excluding certain URLs of a website using the REP does not guarantee confidentiality. To keep a document really secret, you should use other methods (HTTP authentication, VPN, etc.).

The most common case for robots.txt is actually to prevent specific crawlers based on the user-agent from accessing either files and / or directories. To do this you should select the user-agent that could be Googlebot. This will be followed by your statement saying disallow and then the URL that has to be blocked. You can also define rules that are different for Googlebot versus Googlebot-news for example. The great and unique feature of Googlebot is that they can also understand wildcards in robots.txt

So you can also prevent access to specific file types (say all json files) or other specific extensions in a single line of robots.txt statements. For other search engine crawlers this will not be possible if they don’t reside in the same folder.

You can also go and combine directives, so you‘d specify the user-agent that you‘d want to disallow for a specific folder, and then the next line will be a positive allow-statement for example allowing access to a subfolder which resides in the previously disallowed main folder. Combining things is absolutely possible.

Be aware that misusing the robots.txt can lead to significant harm. If you by accident block the entire domain for crawling, that can have severe consequences. So I really want to stress that you should use the robots.txt testing tool that’s available in the Google Search Console.

The debugging mode in the console gives you an idea of how Google understands all the statements that you have It shows if the syntax is correct and how they interpret each and every single line. You can even use sample URLs to test and see if your rules have the desired effect.

Monitor your robots.txt all the time. You’d be amazed how often a robots.txt file gets changed by accident and all of a sudden things stop working as they were.

An advanced use case specific to Google is to combine a disallow statement with a noindex statement. This is an undocumented feature and it is not part of REP, but it does work very well. You would disallow a folder or URL for crawling and noindex directive as well. As a result, as opposed to just disallowing URLs, those tagged with “noindex“ do not appear in SERPs at all. So, using this method you’d prevent crawling as well as indexation at the same time. Again, it is not officially part of the REP protocol which means eventually Google might change its behaviour or will stop supporting it all together. If you’re in for the long-run, it is probably better not to rely on it. However, it can, for example, be a great way for migrations because your staging server would be neither crawled nor indexed – and crawling tools could easily override the robots.txt which would allow you to test indexation directives as well.

If you want to make sure that your robots.txt file is free of dangerous issues, there are several checks available in the SEMrush Site Audit. With these checks, you can find out:

Whether your robots.txt file has format errors
If there are any issues with blocked internal resources in robots.txt
Whether Sitemap.xml is indicated in robots.txt
If the robots.txt file exists

#TechnicalSEO #TechnicalSEOcourse #RobotsTXT #SEMrushAcademy

You May Also Like