robots.txt for AI Bots: How to Allow (or Block) GPTBot, ClaudeBot and More
Updated — 10 min read
Your robots.txt file is the single most powerful lever you have over which AI crawlers can access your website. In one short text file at your domain root, you decide whether bots like GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot, and Google-Extended are allowed to read your pages. The rule is simple and direct: allowing these crawlers is required for your content to be discovered, summarized, and cited by AI answer engines, while blocking them opts your site out of AI training and citation. If you want to show up in ChatGPT, Claude, Perplexity, and Google's AI Overviews, you need to allow the right bots. If you want to keep your content out of AI systems, you block them. This guide is a definitive, practical reference to robots.txt for AI bots, complete with copy-paste examples for every scenario.
Disallow is one of the fastest ways to become invisible to every major AI assistant. Check yours before anything else.How robots.txt works (a quick refresher)
robots.txt is a plain-text file that lives at the root of your domain (for example, https://example.com/robots.txt). It follows the Robots Exclusion Protocol, a voluntary standard that well-behaved crawlers — including every major AI bot — read before fetching your pages. The file is made up of one or more groups, and each group targets a crawler and tells it what it may or may not access.
There are three directives you need to understand:
- `User-agent` — names the crawler the rules apply to. The value is the bot's user-agent token (for example,
GPTBotorClaudeBot). A wildcardUser-agent: *applies to all crawlers that don't have their own specific group. - `Disallow` — tells the named crawler not to access a path.
Disallow: /blocks the entire site;Disallow: /private/blocks only that folder; an emptyDisallow:blocks nothing. - `Allow` — explicitly permits a path, typically used to carve an exception out of a broader
Disallow. Most crawlers, including AI bots, support it.
Two precedence rules matter. First, a crawler obeys the most specific group that names it, not the wildcard group. If you have both a User-agent: * block and a User-agent: GPTBot block, GPTBot follows its own rules and ignores the wildcard entirely. Second, within a group, the most specific (longest) path rule wins when Allow and Disallow overlap. Getting these two rules wrong is the source of most accidental blocks.
The AI crawlers you should know
Modern AI companies run multiple, separately controllable bots — usually one for training, one for search/citation indexing, and one for live user-initiated fetches. This separation is the key to nuanced control: you can allow the bots that get you cited while blocking the ones that only feed model training. The table below lists the current, correct user-agent tokens you put in your robots.txt (use the short token, not the full HTTP user-agent string).
Major AI crawlers, their operators, and what blocking each one does (2026)
| User-agent token | Operator | What it's for | What blocking it does |
|---|---|---|---|
GPTBot | OpenAI | Crawls content used to train OpenAI's foundation models. | Opts your content out of OpenAI model training. Does not by itself remove you from ChatGPT search. |
OAI-SearchBot | OpenAI | Indexes pages so they can be surfaced and cited in ChatGPT search. | Removes you from ChatGPT search results and citations. |
ChatGPT-User | OpenAI | Fetches a specific URL live when a user asks ChatGPT to read or browse it. | Stops ChatGPT from fetching your pages on a user's direct request. |
ClaudeBot | Anthropic | Crawls content used to train Anthropic's Claude models. | Opts your content out of Claude model training. |
Claude-SearchBot | Anthropic | Indexes content for Claude's search and citation features. | Removes you from Claude's search-based answers and citations. |
Claude-User | Anthropic | Fetches pages live when a Claude user's query requires browsing. | Stops Claude from retrieving your pages on a user's request. |
anthropic-ai | Anthropic | Legacy/deprecated training agent (still worth including for completeness). | Blocks the older Anthropic agent token. Largely superseded by ClaudeBot. |
PerplexityBot | Perplexity | Indexes pages so Perplexity can surface and cite them in answers. | Removes you from Perplexity's cited search answers. |
Perplexity-User | Perplexity | Fetches a URL live when a Perplexity user's query needs it. | Stops live, user-initiated fetches by Perplexity. |
Google-Extended | Controls use of your content for Gemini and Google's generative AI training/grounding. NOT a crawler that fetches pages. | Opts you out of Gemini/Vertex generative training. Has NO effect on Google Search crawling or ranking. | |
Googlebot | The main Google Search crawler (also feeds AI Overviews from the search index). | Removes you from Google Search entirely — almost never what you want. | |
Amazonbot | Amazon | Crawls content for Amazon products including Alexa-related answers. | Opts your content out of Amazon's AI use. |
Applebot-Extended | Apple | Controls use of Applebot-crawled content for Apple's generative AI training. | Opts you out of Apple AI training without affecting Siri/Spotlight search indexing. |
CCBot | Common Crawl | Crawls the open web for the Common Crawl dataset, widely used to train many AI models. | Reduces inclusion of your content in a dataset many AI labs train on. |
Bytespider | ByteDance | Crawls content for ByteDance/TikTok AI products. | Opts your content out of ByteDance's AI use. |
Meta-ExternalAgent | Meta | Crawls content used for Meta AI training and products. | Opts your content out of Meta AI training. |
How to ALLOW AI bots (recommended for visibility)
If your goal is AI visibility — being read, summarized, and cited by ChatGPT, Claude, Perplexity, Gemini, and others — the safest configuration is to allow everything. You do not actually need to name each AI bot to allow it; the default behavior of robots.txt is to permit any crawler that isn't disallowed. The single most important thing is to make sure you are not blocking them, intentionally or by accident.
A minimal, fully permissive robots.txt that welcomes all AI bots looks like this:
# Allow all crawlers, including AI bots, to access everything
User-agent: *
Disallow:
# Point crawlers to your sitemap
Sitemap: https://example.com/sitemap.xmlIf you prefer to be explicit (which makes your intent clear to anyone auditing the file and protects you if you later add a restrictive wildcard rule), you can name the major AI bots and grant them full access:
# Explicitly allow major AI crawlers full access
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Amazonbot
Allow: /
User-agent: CCBot
Allow: /
# Everyone else: full access too
User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xmlllms.txt map for AI engines to actually cite you. Run your URL through a free checker like checkgeoscore.com to see how an AI engine perceives your page.How to BLOCK AI bots (if you don't want AI training or citation)
Some publishers — news organizations, premium content businesses, membership sites — deliberately want to keep their work out of AI systems. To block AI crawlers, you give each bot its own group with a Disallow: /. Remember the precedence rule: because these named groups override the wildcard, you must list every bot you want to exclude; a single User-agent: * Disallow: / would block Googlebot and your entire search presence too, which is almost never what you want.
Here is a thorough block list covering the major AI crawlers while leaving normal search engines untouched:
# Block AI training, search, and user-fetch bots
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-SearchBot
Disallow: /
User-agent: Claude-User
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
# Note: this does NOT block Googlebot or Bingbot — normal search stays intactUnderstand the trade-off before you commit. Blocking these bots removes you from AI citations. When someone asks ChatGPT, Claude, or Perplexity a question your content could have answered, your site will not be among the sources. As AI assistants capture a larger share of how people find information, this is a real opportunity cost. Blocking makes sense when your content's value depends on people visiting your site directly or paying for access; it rarely makes sense for businesses that want to be discovered.
Allow some, block others (the nuanced middle ground)
Most sophisticated sites don't want all-or-nothing. The common, defensible policy is: allow the search and citation bots (so you still appear in AI answers with a link back to your site), but block the pure training bots (so your content isn't absorbed into foundation models). OpenAI and Anthropic explicitly support this split because their crawlers are separately addressable.
This configuration blocks training while keeping you eligible for AI search citations:
# --- BLOCK training-only crawlers ---
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
# --- ALLOW search & citation crawlers (keeps you cited) ---
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xmlYou can also apply path-level nuance per bot. For example, allow AI bots to crawl your public blog and documentation but keep them out of gated, account, or checkout areas:
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /account/
Disallow: /checkout/
Disallow: /members/Google-Extended vs Googlebot — a critical distinction
This is the most misunderstood part of AI robots.txt, and getting it wrong can either tank your SEO or fail to opt you out at all. `Googlebot` and `Google-Extended` are completely different tokens with completely different effects.
- `Googlebot` is Google's standard web crawler. It builds the index that powers Google Search — and that same index feeds AI Overviews. Blocking
Googlebotremoves you from Google Search. Do not block it unless you genuinely want to disappear from Google. - `Google-Extended` is not a crawler at all — it's a permission token. It controls whether content Google has already crawled may be used to train and ground generative AI products like Gemini and Vertex AI. It fetches nothing on its own.
The crucial takeaway: blocking `Google-Extended` does NOT hurt your Google Search ranking. Google has been explicit that Google-Extended is independent of search indexing. You can Disallow: / for Google-Extended to opt out of Gemini training while keeping Googlebot fully allowed and your organic rankings completely intact:
# Opt out of Gemini/generative AI training...
User-agent: Google-Extended
Disallow: /
# ...while keeping Google Search ranking fully intact
User-agent: Googlebot
Allow: /Where robots.txt lives and how to test it
Your robots.txt must sit at the root of each host, served as text/plain over HTTP 200. The canonical location is https://yourdomain.com/robots.txt. Crawlers do not look for it in subfolders, and a file at https://yourdomain.com/blog/robots.txt is ignored. Subdomains need their own file: blog.example.com and www.example.com are separate hosts with separate robots.txt files.
The fastest way to confirm what bots actually see is to fetch the live file yourself:
# Fetch the live robots.txt exactly as a crawler would
curl -A "GPTBot" -i https://example.com/robots.txt
# Confirm it returns HTTP 200 and text/plain
curl -sI https://example.com/robots.txtWatch for these gotchas when testing:
- CDN, firewall, and WAF overrides. Cloudflare, Akamai, Fastly, and others can serve their own
robots.txt, inject bot-management rules, or block AI user-agents entirely — even when your origin file says allow. Many platforms now have a one-click "block AI bots" toggle that silently overrides your file. If your config looks right but bots are still blocked, check your CDN/WAF settings. - Caching. Crawlers cache robots.txt (often up to 24 hours). After editing, expect a delay before changes take effect.
- Redirects and 4xx/5xx responses. If
/robots.txtredirects oddly or returns a 5xx error, some crawlers treat the whole site as disallowed; a 404 is generally treated as fully allowed. - Wrong casing or token typos. Tokens are matched case-insensitively for the user-agent name, but a misspelled token like
GPT-BotorClaude-bot-searchsimply matches nothing and silently does nothing.
Common robots.txt mistakes that make you invisible to AI
- **A blanket
User-agent: *Disallow: /** left over from a staging site. This blocks every compliant bot, AI and search alike. It's the number-one cause of accidental invisibility. - Blocking your CSS, JS, or `/api/` routes that render content. If AI bots can't fetch the resources needed to read your page, they see an empty shell.
- Assuming a named group inherits the wildcard. It doesn't. If you add
User-agent: GPTBotwith only anAllow: /blog/, GPTBot ignores every wildcard rule — including any globalDisallowyou thought applied. - Blocking `Googlebot` when you meant `Google-Extended`. This removes you from Google Search instead of just opting out of Gemini training.
- Using the full HTTP user-agent string (the long
Mozilla/5.0 ... GPTBot/1.1line) as theUser-agentvalue. Use the short token only. - Relying on robots.txt to hide private data. robots.txt is public and only requests compliance. Sensitive pages need real authentication, not a
Disallow. - Forgetting subdomains. Your main site allows AI bots, but
blog.example.comhas an old restrictive file you forgot about. - A CDN-level AI block overriding a perfectly correct origin file — and nobody checking the CDN dashboard.
robots.txt vs llms.txt
These two files are complementary, not competing. `robots.txt` governs permission — which crawlers may access which paths. `llms.txt` is an emerging convention that provides a content map for AI engines: a curated, Markdown-formatted index at /llms.txt pointing to your most important, AI-friendly pages so models can find and understand your best content quickly. robots.txt decides whether the door is open; llms.txt is the directory just inside it. A complete GEO setup uses both: allow the right bots in robots.txt, then guide them with llms.txt and clean, structured content.
Frequently asked questions
Should I allow or block GPTBot?+
Allow GPTBot if you want your content to be eligible for use by OpenAI and you value being part of the AI ecosystem; this is the right call for most businesses seeking visibility. Block GPTBot via robots.txt if you specifically want to opt your content out of OpenAI model training. A popular middle path is to block GPTBot (training) while allowing OAI-SearchBot (citation), so you stay out of training but remain citable in ChatGPT search.
How do I block AI crawlers?+
Add a separate group for each AI bot in your robots.txt with Disallow: /. For example: User-agent: GPTBot on one line, Disallow: / on the next, repeated for ClaudeBot, PerplexityBot, Google-Extended, CCBot, and the others. Because named groups override the wildcard, you must list every bot individually — do not use a single User-agent: * Disallow: /, as that would also block Google Search.
How do I block GPTBot in robots.txt specifically?+
Add exactly two lines: User-agent: GPTBot followed by Disallow: /. Use the short token GPTBot, not the full HTTP user-agent string. To also block OpenAI's search and live-fetch bots, add separate groups for OAI-SearchBot and ChatGPT-User.
Does blocking Google-Extended hurt SEO?+
No. Google-Extended only controls whether your content is used for generative AI products like Gemini. It has zero effect on Googlebot, your Google Search indexing, or your organic rankings. You can safely block Google-Extended and keep full Google Search visibility.
Which AI bots should I allow?+
For maximum visibility, allow the search and citation bots: OAI-SearchBot and ChatGPT-User (OpenAI), Claude-SearchBot and Claude-User (Anthropic), and PerplexityBot. Allowing the training bots (GPTBot, ClaudeBot, CCBot, Google-Extended) is optional and depends on whether you're comfortable with your content being used for model training.
Will robots.txt stop AI from using my content?+
It stops compliant, named crawlers from fetching your pages — and all the major AI companies (OpenAI, Anthropic, Google, Perplexity) respect it. It does not technically prevent a non-compliant scraper from reading public pages, nor does it remove content already absorbed into a trained model. For enforcement against bad actors, use server-side or WAF/firewall blocking.
How do I allow ChatGPT to crawl my site?+
Make sure your robots.txt does not disallow OpenAI's bots. To be explicit, allow OAI-SearchBot (ChatGPT search citations) and ChatGPT-User (live fetches when a user asks ChatGPT to read your page), and optionally GPTBot (training). Also confirm your CDN or WAF isn't blocking these user-agents independently of your robots.txt.
What's the difference between GPTBot, OAI-SearchBot, and ChatGPT-User?+
GPTBot crawls content for training OpenAI's models. OAI-SearchBot indexes pages so they can be cited in ChatGPT search. ChatGPT-User fetches a specific URL live when a user asks ChatGPT to read or browse it. They're separately controllable, so you can allow citation while blocking training.
Do AI crawlers respect crawl-delay?+
Support varies. crawl-delay is a non-standard directive that some bots honor and others ignore. It can slow how aggressively a crawler hits your server, but it does not control access. For access control, use Allow/Disallow; for load problems with a specific bot, check that operator's documentation for whether it supports crawl-delay.
How do I check whether AI bots can see my site?+
Fetch your file with curl -A "GPTBot" https://yourdomain.com/robots.txt and read the rules, confirm it returns HTTP 200 as text/plain, and check your CDN/WAF for any AI-blocking toggle. Then run your URL through a free GEO checker such as checkgeoscore.com to see how an AI engine perceives your page and whether anything is blocking it.
Your robots.txt is the gatekeeper for the entire AI era of search. Decide deliberately: allow the AI bots that get you cited, block the ones whose use you object to, and never let a stray Disallow or a forgotten CDN toggle make you invisible to the assistants your audience is already asking. Once your permissions are right, the rest of GEO — semantic structure, structured data, and a clear content map — is what turns access into citations.