What Is Robots.txt?
Robots.txt is a plain text file placed at the root of a website (example.com/robots.txt) that provides instructions to web crawlers about which URLs they are allowed to access. It follows the Robots Exclusion Protocol (REP), first introduced in 1994 and formalized as RFC 9309 in 2022. The file uses simple directive-value pairs: User-agent identifies the crawler, Disallow blocks specific paths, Allow creates exceptions, and Sitemap points to XML sitemaps. Every major search engine — Google, Bing, Yahoo, Yandex, and Baidu — reads and respects robots.txt.
Common Robots.txt Errors
The most frequent robots.txt mistakes include: placing Allow or Disallow directives before any User-agent declaration (crawlers don't know which bot the rules apply to), using relative Sitemap URLs instead of absolute URLs (Sitemap: /sitemap.xml should be Sitemap: https://example.com/sitemap.xml), blocking CSS and JavaScript files that search engines need for rendering (Disallow: /css/ or /js/ hurts Core Web Vitals), having no User-agent: * catch-all block (unnamed bots receive no instructions), and using an empty Disallow without understanding it means 'allow everything.' Each of these errors can silently degrade your site's search performance.
Robots.txt Best Practices for SEO
Start every robots.txt with a User-agent: * block that applies to all crawlers, then add specific blocks for individual bots that need different rules. Always include at least one Sitemap directive pointing to your XML sitemap's full URL. Never use robots.txt to hide sensitive content — it is publicly accessible and provides no security. Instead, use authentication or noindex meta tags. Keep the file under 500 KB (Google's limit). Test changes with Google Search Console's robots.txt tester before deploying. Review the file quarterly to ensure rules match your current site structure.
Robots.txt vs Noindex vs Nofollow
Robots.txt, noindex, and nofollow serve different purposes and are not interchangeable. Robots.txt blocks crawlers from accessing URLs entirely — they won't even fetch the page. The noindex meta tag or X-Robots-Tag header tells crawlers to fetch the page but not add it to the search index. The nofollow attribute tells crawlers not to follow specific links or pass link equity. A critical mistake is using robots.txt to block pages that have noindex tags — if crawlers can't access the page, they can't see the noindex directive, and the page may remain indexed from external links.





