What is Robots.txt? A Complete Guide for SEO and Crawlers
The robots.txt file is one of the smallest files on your website, but it can have a huge SEO impact. A single incorrect rule can block search engines from crawling important pages.
Table of Contents
What is Robots.txt?
Robots.txt is a text file placed at the root of your website that gives crawl instructions to search engine bots and other automated agents. It uses the Robots Exclusion Protocol (REP) to define which URL paths are allowed or disallowed for specific user-agents.
Important: robots.txt controls crawling, not guaranteed indexing. If a URL is blocked but linked from other sites, search engines can still index the URL without full page content.
Where Robots.txt Lives
The file must be available at the exact root path:
https://yourdomain.com/robots.txtIf you place it anywhere else (for example, /files/robots.txt), crawlers may ignore it.
How Crawlers Read Robots.txt
- A crawler requests
/robots.txtbefore crawling. - It looks for rules matching its user-agent name.
- It applies
AllowandDisallowdirectives by path. - It crawls URLs that are allowed by the best matching rule.
Different bots may support directives differently, so always prioritize standards-supported syntax and test in search engine tools.
Robots.txt Syntax and Directives
The most common directives are:
- User-agent: Defines which crawler a rule block applies to.
- Disallow: Blocks crawling for matching paths.
- Allow: Explicitly allows a path, useful when a parent path is disallowed.
- Sitemap: Points crawlers to your XML sitemap URL.
User-agent: *
Disallow: /admin/
Allow: /admin/help/
Sitemap: https://yourdomain.com/sitemap.xmlCommon Examples
1) Allow everything
User-agent: *
Disallow:2) Block all crawling
User-agent: *
Disallow: /3) Block internal folders
User-agent: *
Disallow: /private/
Disallow: /tmp/
Disallow: /checkout/Common Mistakes to Avoid
- Blocking the entire site in production after launching from a staging setup.
- Trying to hide sensitive data in robots.txt. It is publicly accessible and should never be treated as security.
- Blocking JS/CSS assets that search engines need for rendering.
- Using robots.txt instead of noindex when your goal is to keep pages out of search results.
- Forgetting to include a sitemap URL for faster discovery of important pages.
Robots.txt vs Noindex
| Method | Controls | Best For |
|---|---|---|
| robots.txt | Crawling | Reducing crawl load, blocking non-public sections |
| meta robots / x-robots-tag | Indexing behavior | Keeping specific pages out of search results |
Best Practices
- Keep the file simple and intentional. Avoid unnecessary wildcards and duplicate rules.
- Only disallow paths you truly do not want crawled.
- Test with search engine webmaster tools after each major change.
- Add your sitemap directive at the bottom of the file.
- Version-control your robots.txt changes so mistakes are easy to roll back.
Generate a Robots.txt File
If you want to quickly build a valid robots.txt with common rule templates, use our free generator:
References
- Koster, M. (2022). RFC 9309: Robots Exclusion Protocol. IETF. https://datatracker.ietf.org/doc/html/rfc9309
- Google Search Central. How Google interprets the robots.txt specification. https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt
- Google Search Central. Robots meta tag and X-Robots-Tag specifications. https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag