Free Robots.txt Generator
Generate a robots.txt file for your website in seconds. Block AI scrapers, configure crawl rules, and add your sitemap URL.
What Is robots.txt?
A robots.txt file is a plain text file placed at the root of your website (e.g., https://example.com/robots.txt) that tells web crawlers โ search engine bots, AI scrapers, and other automated agents โ which pages and directories on your site they are allowed or not allowed to access. It implements the Robots Exclusion Protocol (REP), a voluntary standard first proposed in 1994 and still universally respected by legitimate search engine crawlers including Googlebot, Bingbot, and Yandex.
The file uses a simple directive-based format. Each block starts with a User-agent line specifying which bot the rules apply to, followed by one or more Disallow or Allow directives. A wildcard User-agent: * applies rules to all bots not otherwise specified. The Sitemap: directive at the end tells crawlers where to find your XML sitemap.
How to Block AI Bots with robots.txt
Since 2023, a wave of AI training crawlers has emerged that scrape website content to train large language models (LLMs). Unlike search engine bots that drive traffic to your site, AI training crawlers consume your content without any traffic benefit. Many website owners choose to block them via robots.txt. The major AI crawlers and their User-agent strings are:
- GPTBot โ OpenAI's training crawler for ChatGPT and GPT models
- CCBot โ Common Crawl bot, used as training data for many open-source LLMs
- Claude-Web โ Anthropic's training crawler
- Google-Extended โ Google's bot for training Bard/Gemini (separate from search indexing)
- FacebookBot โ Meta's crawler, used in part for AI training
- Bytespider โ ByteDance/TikTok's crawler
- Applebot-Extended โ Apple's extended crawler for AI training
To block these in robots.txt, add a separate User-agent block for each with Disallow: /. Note that robots.txt is a voluntary standard โ malicious scrapers may ignore it. For stronger protection against scrapers, consider server-level rate limiting, CAPTCHA, or blocking known AI crawler IP ranges.
SEO Crawl Directives โ Best Practices
Robots.txt is a powerful but blunt tool. Using it incorrectly can accidentally block Google from crawling important pages and destroy your search rankings. Here are the most important best practices:
- Never block CSS, JavaScript, or image files from Googlebot. Google needs to render your pages to understand their content. Blocking resources prevents rendering and can hurt rankings significantly.
- Do not use robots.txt to hide sensitive pages. Disallow in robots.txt does not make a page private โ it just prevents crawling. The URL may still appear in search results if other sites link to it. Use
noindexmeta tags or password protection for truly private content. - Block thin or duplicate content. Common paths to block:
/wp-admin/on WordPress,/cartand/checkouton e-commerce sites, search result pages (/?s=), printer-friendly versions of pages, and admin panels. - Always include your sitemap URL. Adding
Sitemap: https://example.com/sitemap.xmlin robots.txt helps all crawlers discover your sitemap without having to submit it manually to each search engine. - Test your robots.txt. Use Google Search Console's robots.txt tester to verify your directives before deploying, especially if using wildcards.
Common robots.txt Patterns
Different website platforms have well-known robots.txt patterns. WordPress sites typically block /wp-admin/, /wp-includes/, and /wp-login.php. Shopify sites block /admin, /cart, /checkout, /orders, and search result pages. E-commerce sites should also block faceted navigation parameters that create thousands of duplicate URL variants, which waste crawl budget and dilute link equity.
Frequently Asked Questions
User-agent: * applies to all bots not explicitly named in another block. Specific user-agent rules take precedence over wildcard rules for that specific bot.Crawl-delay directive tells bots to wait a specified number of seconds between requests to reduce server load. For example, Crawl-delay: 10 asks bots to wait 10 seconds between page requests. However, Googlebot does not respect this directive โ use Google Search Console's crawl rate settings to control Googlebot's speed. Bing and some other bots do respect Crawl-delay.Related Free Tools
Need a custom tool built for your business?
Get a Free Quote