·10 min read

What is Robots.txt? A Complete Guide for SEO and Crawlers

The robots.txt file is one of the smallest files on your website, but it can have a huge SEO impact. A single incorrect rule can block search engines from crawling important pages.

What is Robots.txt?

Robots.txt is a text file placed at the root of your website that gives crawl instructions to search engine bots and other automated agents. It uses the Robots Exclusion Protocol (REP) to define which URL paths are allowed or disallowed for specific user-agents.

Important: robots.txt controls crawling, not guaranteed indexing. If a URL is blocked but linked from other sites, search engines can still index the URL without full page content.

Where Robots.txt Lives

The file must be available at the exact root path:

https://yourdomain.com/robots.txt

If you place it anywhere else (for example, /files/robots.txt), crawlers may ignore it.

How Crawlers Read Robots.txt

  1. A crawler requests /robots.txt before crawling.
  2. It looks for rules matching its user-agent name.
  3. It applies Allow and Disallow directives by path.
  4. It crawls URLs that are allowed by the best matching rule.

Different bots may support directives differently, so always prioritize standards-supported syntax and test in search engine tools.

Robots.txt Syntax and Directives

The most common directives are:

  • User-agent: Defines which crawler a rule block applies to.
  • Disallow: Blocks crawling for matching paths.
  • Allow: Explicitly allows a path, useful when a parent path is disallowed.
  • Sitemap: Points crawlers to your XML sitemap URL.
User-agent: *
Disallow: /admin/
Allow: /admin/help/

Sitemap: https://yourdomain.com/sitemap.xml

Common Examples

1) Allow everything

User-agent: *
Disallow:

2) Block all crawling

User-agent: *
Disallow: /

3) Block internal folders

User-agent: *
Disallow: /private/
Disallow: /tmp/
Disallow: /checkout/

Common Mistakes to Avoid

  • Blocking the entire site in production after launching from a staging setup.
  • Trying to hide sensitive data in robots.txt. It is publicly accessible and should never be treated as security.
  • Blocking JS/CSS assets that search engines need for rendering.
  • Using robots.txt instead of noindex when your goal is to keep pages out of search results.
  • Forgetting to include a sitemap URL for faster discovery of important pages.

Robots.txt vs Noindex

MethodControlsBest For
robots.txtCrawlingReducing crawl load, blocking non-public sections
meta robots / x-robots-tagIndexing behaviorKeeping specific pages out of search results

Best Practices

  • Keep the file simple and intentional. Avoid unnecessary wildcards and duplicate rules.
  • Only disallow paths you truly do not want crawled.
  • Test with search engine webmaster tools after each major change.
  • Add your sitemap directive at the bottom of the file.
  • Version-control your robots.txt changes so mistakes are easy to roll back.

Generate a Robots.txt File

If you want to quickly build a valid robots.txt with common rule templates, use our free generator:

References