Google Announces End of Support for Robots.txt Noindex Directives
Google announced early this month that they will be discontinuing their unofficial support of noindex directives within robots.txt files. As of 1 September 2019, publishers who are currently relying on robots.txt crawl-delay, nofollow, and noindex directives will need to find another way to instruct search engine robots how to crawl their sites’ pages.
These policy updates will mean rapid changes are necessary for many web publishers in preserving their approach to search engine optimisation. Are you one of them? Continue reading to learn more about Google’s major policy update and what it will mean moving forward.
What are robots.txt files?
Webmasters create robots.txt text files to direct user agents (web crawlers) how to crawl the pages on their websites. These text files either allow or disallow web robots such as search engine crawlers to engage in specified behaviour.
Robots.txt files can be as short as two lines long. The first line specifies a user agent to which the directive applies. The second provides specific instructions to that user agent, such as allow or disallow. Specific web crawlers will disregard robots.txt files that are not directed at them, and some will ignore the files that are. Google will soon count themselves among the latter group.
Google has long discouraged publishers from using crawl-delay, nofollow, and noindex directives within robots.txt files, but have followed most directives in spite of having no standardised policy toward them.
Why is Google ditching robots.txt now?
Google has spent years trying to standardise their robot exclusion protocol so that they can move ahead of this change. This is why Google has also long encouraged publishers to find alternatives to robots.txt directives.
In their announcement, Google said they were making the Robots Exclusion Protocol (REP) an internet standard. To do so, they have open-sourced the C++ library they used to parse and match roles in robots.txt files. The 20-plus-year-old library, along with a testing tool offered by Google, can help developers create the parsing tools of the future.
How can you keep controlling crawling on your site?
Disregarding robots.txt noindex does not leave publishers without means to control crawling on their sites.
Use robots meta tags to noindex
Robots meta tags are supported in HTTP response headers, as well as HTML.
Use HTTP status codes 404 and 410
These status codes tell crawlers that the page does not exist, thus dropping it from Google’s index once they’ve been crawled and processed.
Hide content behind password protections
Unless you’ve signaled subscription or paywall content with markup, content that is concealed behind login pages will often remove it from Google’s index.
Disallow in robots.txt
If search engines are disallowed from crawling a page, that content cannot be indexed.
Dallas has over a decade of local and overseas experience in digital marketing & web development. He previously co-founded a web & mobile app development company and holds multiple industry-leading SEO certifications.
Dallas is interested in the intersection between digital media & business. He holds a Bachelor of Arts degree from the University of Auckland - a Film, TV & Media Studies major, and a Diploma in Business from Auckland University of Technology.
GET ACTIONABLE ADVICE, WEEKLY
Subscribe to our blog and get awesome digital marketing content sent straight to your inbox.