Differences between 'noindex' and robots.txt in SEO Use them correctly!
If you have reached this post, surely you already know what deindexing is and why you want to undertake it on some of your pages.
The objective of this post is to differentiate well between the robots.txt and the tag “noindex” call each thing by name and not be confused.
In this post you will be able to read:
Main objective: deindexation
If you have reached this post, you probably already know what deindexation is and why you want to undertake it in some of your pages .
Most likely, you have doubts about which method to use to achieve this, since you know that both exist but you do not know what each implies, which is better, if you use them simultaneously …
Don't worry, the final objective is going to be achieved : de-index. Now, I'm going to explain which of the two methods suits you best in each case and the differences between them.
What is Robots.txt?
The robots.txt file is a text file that you should store at the root of your website. This file is used to give orders to the different search engines in order to block or allow the access of spiders or bots to a URL, directory or web. It is understood that a page that is blocked by robots.txt should not be traceable because a bot cannot be accessed, and therefore it should not be indexable.
When we include a line in the robots file with a disallow: followed by a URL or a directory, we are telling a search engine that we don't want your bots to access that place . It is a completely closed door to tracking, although with caveats.
What do we achieve with this 'disallow'?
If a certain page is marked with disallow, GoogleBot will not access and consequently may not store or analyze it which has the effect of de-indexing or directly the non-indexing of a page. The latter is easy, we de-index when a page has been indexed at some point in its life, and if a page is 'born' directly in disallow, it will never (or should not) be indexed.
If you have understood these last paragraphs you will understand why I like to say that robots.txt is not a tool created to deindex . Whaaat? Yes, being correct and more papist than the Pope, it is so. It is a tool that allows to choose which parts of a website are tracked by bots only one of the consequences of disallow is the non-indexing of those pages.
How to use Robots.txt?
The robots.txt can be accessed in several ways as long as it is generated, of course.
If you have WordPress as CMS, you can edit robots.txt easily with SEO by Yoast plugin. Here is a complete Yoast SEO tutorial .
Within the plugin configuration, go to Tools> File editor . Inside you will see the lines that make up the robots.txt.
From here you can directly add the disallow you want. Type manually:
Disallow: / url /
This way you will be blocking access to search engines to that URL.
Remember that you can choose which search engines you give the directions to using the 'User- agent '.
Another way to access robots.txt is via FTP or even depending on hosting, through Cpanel . The file will be at the root of the web and from there you can edit it. Add the lines and disallows as necessary and save the changes.
Important : you should only make the changes in one of the two places. If you apply the changes for example through a plugin like Yoast, it won't be necessary to do it through FTP. It is updated in both places.
What is the meta tag 'no index'?
The so-called 'meta robots' are HTML tags that are included on every page of a website.
These tags tell Google or other engines how they should proceed with that page regarding crawling and indexing . They are used to establish which URLs are indexed or not in search engines.
Clarification: not all pages must contain the robots meta tag. In the event that a page does not have it, it will be understood that that page is index, follow .
Within this label we can include up to four different combinations depending on what you need and how we want the bots to behave:
- Index, follow: with this combination we indicate that the page is indexable and we want the links it contains to be followed.
- Index, nofollow: the page is indexable but the links on that page do not want them to be followed.
- Noindex, follow: is the most common for to de-index pages . We indicate that the page should not be indexed, but we do want the links to be followed.
- Noindex, nofollow: in this way we indicate that we do not want the page to be indexed and neither do the links be followed .
How to use it
Depending on the website you have, how it is built, the CMS you use may change the way you apply it. What never changes is the label in question:
This label must be in the header of each page in which you want to give these indications.
In the event that you have WordPress and Yoast SEO, you just have to check some buttons to index or not to index, follow or nofollow the pages, content types, taxonomies or files you want. You can see this step by step in the Yoast SEO guide . I also leave you here a tutorial to correctly de-index with «noindex»
Differences between Robots.txt and no index
Although the final objective and the result is the same, de-indexing, Doing it one way or another has certain differences that should not be overlooked, and that should make you think about when you use each one and for which cases.
The biggest difference between the two lies in tracking!
|Robots. txt||Meta robots [noindex,follow]|
|The bot does not crawl the page||The bot does crawl the page|
|It does not follow the links nor does it transmit authority||It follows the links and it transmits authority|
|Content visible to users||Content visible to users|
|It does not index *||It does not index|
The big difference between the robots.txt and the meta robots no index, if it were ' noindex, follow ' resides in trace .
A UR L in disallow will not be tracked in the least, that is, the bot does not waste time is to crawl its content. In the case of the noindex, GoogleBot will access content and among other things in that crawl, you will see the meta name robots tag.
Another very important difference is the subject of links . With the noindex, follow the links on that page will be followed by the bots and will convey authority, unless a specific link has another tag, rel = »nofollow» . This is the great advantage of the meta robots since it allows you to de-index a page without disregarding the tracing of the links it contains. Something really useful especially for internal linking .
For example, the category pages of my blog are 'noindex, follow' since I don't want to index the category pages but I do want Google discovers and tracks internal links to articles that I do want to index.
The disadvantage of meta name robots is precisely that search that the bot is going to perform. If you think about it, we are telling GoogleBot to waste time and use resources to crawl a page that we are not going to index. We understand that this is detrimental to the so-called crawl budget or crawl budget .
In the table I have put with an asterisk the “does not index” in Robots.txt and that is that there are certain occasions when it may not de-index a page blocked with disallow.
Specific case: I have already seen it on several occasions. If we mark a URL or directory with “noindex” in the meta name robots tag and block that URL or directory with a Disallow in Robots.txt the bots cannot access said URLs and therefore will not see the noindex tag.
What does this imply? Sometimes it happens that Google ends up not de-indexing those pages, despite having noindex (but it doesn't see it) and despite being blocked in Robots.txt.
When this happens, in Search Console, in the Coverage report we see how in Warnings some URLs marked as appear« It has been indexed although a robots.txt file has blocked ». It is also convenient that we see this type of inconsistencies within “Valid”, where “Indexed, not sent in sitemap” may appear.
This is inconsistent for several reasons. The first because the URLs that we show in Sitemap should be the ones that we want to index. If we have indexed pages that are not in the Sitemap, we have to review it, either because they have not been included or because pages that should not be indexed are being indexed.
If you use WordPress and Yoast SEO this is easy to understand: when you mark a URL with «noindex» in Yoast, it is automatically removed from the Sitemap. A page that has “noindex” and is not on the sitemap makes no sense that it is being indexed, so there may be a conflict like the one we have seen above.
Which one do I use in each case?
Now that you understand what the robots.txt and noindex are for, what each implies and their differences, you should think and decide which one you use according to which cases.
Is it worth it for GoogleBot to waste time crawling a URL that you don't want to index? You will find the answer above all, in the internal links. If that page you want to de-index has useful internal links for your strategy, pages you want to position will surely be worth it.
We must understand that the correct way to indicate to Google that a page should not be indexed is with the “noindex »In the robots meta tag.
When do we use the robots.txt? We must use it for different purposes than just deindexing. We must block, by robots.txt, especially those parts of a website that are not only not relevant to the user, but we do not want a search engine to access them under any circumstances.
There are certain pages that you could put into robots. txt, since they usually never have SEO utility:
- Legal notice
- Purchase conditions
- Private access
- Cart for an e-commerce
- On pages that have already been de-indexed, which have noindex and you want avoid crawling
Have you made clear the difference between the robots.txt and the non-index? If you have any questions, leave a comment and I will reply as soon as possible.