What Is robots.txt?

A robots.txt file is a plain text file placed at the root of your website (e.g., https://example.com/robots.txt) that tells web crawlers — search engine bots, AI scrapers, and other automated agents — which pages and directories on your site they are allowed or not allowed to access. It implements the Robots Exclusion Protocol (REP), a voluntary standard first proposed in 1994 and still universally respected by legitimate search engine crawlers including Googlebot, Bingbot, and Yandex.

The file uses a simple directive-based format. Each block starts with a User-agent line specifying which bot the rules apply to, followed by one or more Disallow or Allow directives. A wildcard User-agent: * applies rules to all bots not otherwise specified. The Sitemap: directive at the end tells crawlers where to find your XML sitemap.

How to Block AI Bots with robots.txt

Since 2023, a wave of AI training crawlers has emerged that scrape website content to train large language models (LLMs). Unlike search engine bots that drive traffic to your site, AI training crawlers consume your content without any traffic benefit. Many website owners choose to block them via robots.txt. The major AI crawlers and their User-agent strings are:

GPTBot — OpenAI's training crawler for ChatGPT and GPT models
CCBot — Common Crawl bot, used as training data for many open-source LLMs
Claude-Web — Anthropic's training crawler
Google-Extended — Google's bot for training Bard/Gemini (separate from search indexing)
FacebookBot — Meta's crawler, used in part for AI training
Bytespider — ByteDance/TikTok's crawler
Applebot-Extended — Apple's extended crawler for AI training

To block these in robots.txt, add a separate User-agent block for each with Disallow: /. Note that robots.txt is a voluntary standard — malicious scrapers may ignore it. For stronger protection against scrapers, consider server-level rate limiting, CAPTCHA, or blocking known AI crawler IP ranges.

SEO Crawl Directives — Best Practices

Robots.txt is a powerful but blunt tool. Using it incorrectly can accidentally block Google from crawling important pages and destroy your search rankings. Here are the most important best practices:

Never block CSS, JavaScript, or image files from Googlebot. Google needs to render your pages to understand their content. Blocking resources prevents rendering and can hurt rankings significantly.
Do not use robots.txt to hide sensitive pages. Disallow in robots.txt does not make a page private — it just prevents crawling. The URL may still appear in search results if other sites link to it. Use noindex meta tags or password protection for truly private content.
Block thin or duplicate content. Common paths to block: /wp-admin/ on WordPress, /cart and /checkout on e-commerce sites, search result pages (/?s=), printer-friendly versions of pages, and admin panels.
Always include your sitemap URL. Adding Sitemap: https://example.com/sitemap.xml in robots.txt helps all crawlers discover your sitemap without having to submit it manually to each search engine.
Test your robots.txt. Use Google Search Console's robots.txt tester to verify your directives before deploying, especially if using wildcards.

Common robots.txt Patterns

Different website platforms have well-known robots.txt patterns. WordPress sites typically block /wp-admin/, /wp-includes/, and /wp-login.php. Shopify sites block /admin, /cart, /checkout, /orders, and search result pages. E-commerce sites should also block faceted navigation parameters that create thousands of duplicate URL variants, which waste crawl budget and dilute link equity.

Frequently Asked Questions

No. Robots.txt is a voluntary standard. Legitimate crawlers like Googlebot and Bingbot respect it, but malicious scrapers, spambots, and some AI training crawlers may ignore it entirely. For truly sensitive content, use server authentication, IP blocking, or CAPTCHAs rather than relying on robots.txt alone.

Yes. You can have as many User-agent blocks as you need. Each block applies specifically to the named bot. The wildcard User-agent: * applies to all bots not explicitly named in another block. Specific user-agent rules take precedence over wildcard rules for that specific bot.

Blocking crawlers from important content will prevent that content from being indexed, which will hurt rankings for those pages. Only block pages you genuinely don't want indexed — admin areas, duplicate pages, internal search results, and staging content. Never block your main content, product pages, blog posts, or any page you want to rank in search results.

Googlebot caches robots.txt for up to 24 hours and re-reads it periodically. Other bots vary. If you make changes to your robots.txt, it may take up to 24 hours for Googlebot to apply the new rules. You can use Google Search Console to request a fresh fetch of robots.txt if you need faster propagation.

The Crawl-delay directive tells bots to wait a specified number of seconds between requests to reduce server load. For example, Crawl-delay: 10 asks bots to wait 10 seconds between page requests. However, Googlebot does not respect this directive — use Google Search Console's crawl rate settings to control Googlebot's speed. Bing and some other bots do respect Crawl-delay.

This is a personal and business decision. Blocking AI training crawlers prevents your content from being used to train LLMs without compensation or attribution, but has no effect on search rankings. Some publishers choose to block all AI training bots on principle; others prefer to allow them hoping for future traffic from AI-powered search features. OpenAI, Anthropic, and Google have all publicly committed to respecting robots.txt for their training crawlers.

Related Free Tools

Need a custom tool built for your business?

Get a Free Quote

Free Robots.txt Generator

What Is robots.txt?

How to Block AI Bots with robots.txt

SEO Crawl Directives — Best Practices

Common robots.txt Patterns

Frequently Asked Questions

Does robots.txt guarantee bots won't visit my pages?

Can I have multiple User-agent blocks?

Will blocking content hurt my SEO?

How often do bots re-read robots.txt?

What is the Crawl-delay directive?

Should I block GPTBot and other AI crawlers?

Related Free Tools