robots.txt for AI Bots: How to Allow (or Block) GPTBot, ClaudeBot and More

Updated 10 min read

Your robots.txt file is the single most powerful lever you have over which AI crawlers can access your website. In one short text file at your domain root, you decide whether bots like GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot, and Google-Extended are allowed to read your pages. The rule is simple and direct: allowing these crawlers is required for your content to be discovered, summarized, and cited by AI answer engines, while blocking them opts your site out of AI training and citation. If you want to show up in ChatGPT, Claude, Perplexity, and Google's AI Overviews, you need to allow the right bots. If you want to keep your content out of AI systems, you block them. This guide is a definitive, practical reference to robots.txt for AI bots, complete with copy-paste examples for every scenario.

Bottom line: robots.txt does not push your content into AI systems — but it can lock them out. For Generative Engine Optimization (GEO), an accidental Disallow is one of the fastest ways to become invisible to every major AI assistant. Check yours before anything else.

How robots.txt works (a quick refresher)

robots.txt is a plain-text file that lives at the root of your domain (for example, https://example.com/robots.txt). It follows the Robots Exclusion Protocol, a voluntary standard that well-behaved crawlers — including every major AI bot — read before fetching your pages. The file is made up of one or more groups, and each group targets a crawler and tells it what it may or may not access.

There are three directives you need to understand:

  • `User-agent` — names the crawler the rules apply to. The value is the bot's user-agent token (for example, GPTBot or ClaudeBot). A wildcard User-agent: * applies to all crawlers that don't have their own specific group.
  • `Disallow` — tells the named crawler not to access a path. Disallow: / blocks the entire site; Disallow: /private/ blocks only that folder; an empty Disallow: blocks nothing.
  • `Allow` — explicitly permits a path, typically used to carve an exception out of a broader Disallow. Most crawlers, including AI bots, support it.

Two precedence rules matter. First, a crawler obeys the most specific group that names it, not the wildcard group. If you have both a User-agent: * block and a User-agent: GPTBot block, GPTBot follows its own rules and ignores the wildcard entirely. Second, within a group, the most specific (longest) path rule wins when Allow and Disallow overlap. Getting these two rules wrong is the source of most accidental blocks.

Important: robots.txt is permission, not enforcement. It tells compliant bots what they may crawl, but it does not technically prevent a non-compliant scraper from reading public pages. All the major, named AI crawlers below respect robots.txt. To hard-block bad actors, you need server-side or firewall/WAF rules.

The AI crawlers you should know

Modern AI companies run multiple, separately controllable bots — usually one for training, one for search/citation indexing, and one for live user-initiated fetches. This separation is the key to nuanced control: you can allow the bots that get you cited while blocking the ones that only feed model training. The table below lists the current, correct user-agent tokens you put in your robots.txt (use the short token, not the full HTTP user-agent string).

Major AI crawlers, their operators, and what blocking each one does (2026)

User-agent tokenOperatorWhat it's forWhat blocking it does
GPTBotOpenAICrawls content used to train OpenAI's foundation models.Opts your content out of OpenAI model training. Does not by itself remove you from ChatGPT search.
OAI-SearchBotOpenAIIndexes pages so they can be surfaced and cited in ChatGPT search.Removes you from ChatGPT search results and citations.
ChatGPT-UserOpenAIFetches a specific URL live when a user asks ChatGPT to read or browse it.Stops ChatGPT from fetching your pages on a user's direct request.
ClaudeBotAnthropicCrawls content used to train Anthropic's Claude models.Opts your content out of Claude model training.
Claude-SearchBotAnthropicIndexes content for Claude's search and citation features.Removes you from Claude's search-based answers and citations.
Claude-UserAnthropicFetches pages live when a Claude user's query requires browsing.Stops Claude from retrieving your pages on a user's request.
anthropic-aiAnthropicLegacy/deprecated training agent (still worth including for completeness).Blocks the older Anthropic agent token. Largely superseded by ClaudeBot.
PerplexityBotPerplexityIndexes pages so Perplexity can surface and cite them in answers.Removes you from Perplexity's cited search answers.
Perplexity-UserPerplexityFetches a URL live when a Perplexity user's query needs it.Stops live, user-initiated fetches by Perplexity.
Google-ExtendedGoogleControls use of your content for Gemini and Google's generative AI training/grounding. NOT a crawler that fetches pages.Opts you out of Gemini/Vertex generative training. Has NO effect on Google Search crawling or ranking.
GooglebotGoogleThe main Google Search crawler (also feeds AI Overviews from the search index).Removes you from Google Search entirely — almost never what you want.
AmazonbotAmazonCrawls content for Amazon products including Alexa-related answers.Opts your content out of Amazon's AI use.
Applebot-ExtendedAppleControls use of Applebot-crawled content for Apple's generative AI training.Opts you out of Apple AI training without affecting Siri/Spotlight search indexing.
CCBotCommon CrawlCrawls the open web for the Common Crawl dataset, widely used to train many AI models.Reduces inclusion of your content in a dataset many AI labs train on.
BytespiderByteDanceCrawls content for ByteDance/TikTok AI products.Opts your content out of ByteDance's AI use.
Meta-ExternalAgentMetaCrawls content used for Meta AI training and products.Opts your content out of Meta AI training.
Note the pattern: for OpenAI, Anthropic, and Perplexity there is a training bot, a search/citation bot, and a live-user bot. Blocking the training bot but allowing the search/citation bot is how you stay cited while opting out of training — the most common GEO-friendly compromise.

If your goal is AI visibility — being read, summarized, and cited by ChatGPT, Claude, Perplexity, Gemini, and others — the safest configuration is to allow everything. You do not actually need to name each AI bot to allow it; the default behavior of robots.txt is to permit any crawler that isn't disallowed. The single most important thing is to make sure you are not blocking them, intentionally or by accident.

A minimal, fully permissive robots.txt that welcomes all AI bots looks like this:

# Allow all crawlers, including AI bots, to access everything
User-agent: *
Disallow:

# Point crawlers to your sitemap
Sitemap: https://example.com/sitemap.xml

If you prefer to be explicit (which makes your intent clear to anyone auditing the file and protects you if you later add a restrictive wildcard rule), you can name the major AI bots and grant them full access:

# Explicitly allow major AI crawlers full access
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Amazonbot
Allow: /

User-agent: CCBot
Allow: /

# Everyone else: full access too
User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml
Allowing AI bots is necessary but not sufficient for GEO. Once they can crawl you, you still need clean semantic HTML, clear answer-first content, structured data, and an llms.txt map for AI engines to actually cite you. Run your URL through a free checker like checkgeoscore.com to see how an AI engine perceives your page.

How to BLOCK AI bots (if you don't want AI training or citation)

Some publishers — news organizations, premium content businesses, membership sites — deliberately want to keep their work out of AI systems. To block AI crawlers, you give each bot its own group with a Disallow: /. Remember the precedence rule: because these named groups override the wildcard, you must list every bot you want to exclude; a single User-agent: * Disallow: / would block Googlebot and your entire search presence too, which is almost never what you want.

Here is a thorough block list covering the major AI crawlers while leaving normal search engines untouched:

# Block AI training, search, and user-fetch bots
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-SearchBot
Disallow: /

User-agent: Claude-User
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# Note: this does NOT block Googlebot or Bingbot — normal search stays intact

Understand the trade-off before you commit. Blocking these bots removes you from AI citations. When someone asks ChatGPT, Claude, or Perplexity a question your content could have answered, your site will not be among the sources. As AI assistants capture a larger share of how people find information, this is a real opportunity cost. Blocking makes sense when your content's value depends on people visiting your site directly or paying for access; it rarely makes sense for businesses that want to be discovered.

Allow some, block others (the nuanced middle ground)

Most sophisticated sites don't want all-or-nothing. The common, defensible policy is: allow the search and citation bots (so you still appear in AI answers with a link back to your site), but block the pure training bots (so your content isn't absorbed into foundation models). OpenAI and Anthropic explicitly support this split because their crawlers are separately addressable.

This configuration blocks training while keeping you eligible for AI search citations:

# --- BLOCK training-only crawlers ---
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# --- ALLOW search & citation crawlers (keeps you cited) ---
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

You can also apply path-level nuance per bot. For example, allow AI bots to crawl your public blog and documentation but keep them out of gated, account, or checkout areas:

User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /account/
Disallow: /checkout/
Disallow: /members/
Per-bot path control is exact-match by prefix and follows the longest-rule-wins precedence. When you mix Allow and Disallow for one bot, test it (see below) — it's easy to write a rule that does the opposite of what you intended.

Google-Extended vs Googlebot — a critical distinction

This is the most misunderstood part of AI robots.txt, and getting it wrong can either tank your SEO or fail to opt you out at all. `Googlebot` and `Google-Extended` are completely different tokens with completely different effects.

  • `Googlebot` is Google's standard web crawler. It builds the index that powers Google Search — and that same index feeds AI Overviews. Blocking Googlebot removes you from Google Search. Do not block it unless you genuinely want to disappear from Google.
  • `Google-Extended` is not a crawler at all — it's a permission token. It controls whether content Google has already crawled may be used to train and ground generative AI products like Gemini and Vertex AI. It fetches nothing on its own.

The crucial takeaway: blocking `Google-Extended` does NOT hurt your Google Search ranking. Google has been explicit that Google-Extended is independent of search indexing. You can Disallow: / for Google-Extended to opt out of Gemini training while keeping Googlebot fully allowed and your organic rankings completely intact:

# Opt out of Gemini/generative AI training...
User-agent: Google-Extended
Disallow: /

# ...while keeping Google Search ranking fully intact
User-agent: Googlebot
Allow: /
Symmetry check: blocking Google-Extended opts you out of Gemini training but does NOT remove you from Google's AI Overviews, because those are generated from the regular search index that Googlebot builds. To leave AI Overviews you'd have to leave Google Search — which is why most sites simply allow both.

Where robots.txt lives and how to test it

Your robots.txt must sit at the root of each host, served as text/plain over HTTP 200. The canonical location is https://yourdomain.com/robots.txt. Crawlers do not look for it in subfolders, and a file at https://yourdomain.com/blog/robots.txt is ignored. Subdomains need their own file: blog.example.com and www.example.com are separate hosts with separate robots.txt files.

The fastest way to confirm what bots actually see is to fetch the live file yourself:

# Fetch the live robots.txt exactly as a crawler would
curl -A "GPTBot" -i https://example.com/robots.txt

# Confirm it returns HTTP 200 and text/plain
curl -sI https://example.com/robots.txt

Watch for these gotchas when testing:

  • CDN, firewall, and WAF overrides. Cloudflare, Akamai, Fastly, and others can serve their own robots.txt, inject bot-management rules, or block AI user-agents entirely — even when your origin file says allow. Many platforms now have a one-click "block AI bots" toggle that silently overrides your file. If your config looks right but bots are still blocked, check your CDN/WAF settings.
  • Caching. Crawlers cache robots.txt (often up to 24 hours). After editing, expect a delay before changes take effect.
  • Redirects and 4xx/5xx responses. If /robots.txt redirects oddly or returns a 5xx error, some crawlers treat the whole site as disallowed; a 404 is generally treated as fully allowed.
  • Wrong casing or token typos. Tokens are matched case-insensitively for the user-agent name, but a misspelled token like GPT-Bot or Claude-bot-search simply matches nothing and silently does nothing.

Common robots.txt mistakes that make you invisible to AI

  • **A blanket User-agent: * Disallow: /** left over from a staging site. This blocks every compliant bot, AI and search alike. It's the number-one cause of accidental invisibility.
  • Blocking your CSS, JS, or `/api/` routes that render content. If AI bots can't fetch the resources needed to read your page, they see an empty shell.
  • Assuming a named group inherits the wildcard. It doesn't. If you add User-agent: GPTBot with only an Allow: /blog/, GPTBot ignores every wildcard rule — including any global Disallow you thought applied.
  • Blocking `Googlebot` when you meant `Google-Extended`. This removes you from Google Search instead of just opting out of Gemini training.
  • Using the full HTTP user-agent string (the long Mozilla/5.0 ... GPTBot/1.1 line) as the User-agent value. Use the short token only.
  • Relying on robots.txt to hide private data. robots.txt is public and only requests compliance. Sensitive pages need real authentication, not a Disallow.
  • Forgetting subdomains. Your main site allows AI bots, but blog.example.com has an old restrictive file you forgot about.
  • A CDN-level AI block overriding a perfectly correct origin file — and nobody checking the CDN dashboard.

robots.txt vs llms.txt

These two files are complementary, not competing. `robots.txt` governs permission — which crawlers may access which paths. `llms.txt` is an emerging convention that provides a content map for AI engines: a curated, Markdown-formatted index at /llms.txt pointing to your most important, AI-friendly pages so models can find and understand your best content quickly. robots.txt decides whether the door is open; llms.txt is the directory just inside it. A complete GEO setup uses both: allow the right bots in robots.txt, then guide them with llms.txt and clean, structured content.

Quick GEO checklist: (1) confirm AI bots aren't blocked in robots.txt or at your CDN, (2) publish a sitemap and an llms.txt, (3) make sure your key answers are in semantic HTML, not JavaScript-rendered shells, (4) add structured data. Then verify the result with a free GEO score check.

Frequently asked questions

Should I allow or block GPTBot?+

Allow GPTBot if you want your content to be eligible for use by OpenAI and you value being part of the AI ecosystem; this is the right call for most businesses seeking visibility. Block GPTBot via robots.txt if you specifically want to opt your content out of OpenAI model training. A popular middle path is to block GPTBot (training) while allowing OAI-SearchBot (citation), so you stay out of training but remain citable in ChatGPT search.

How do I block AI crawlers?+

Add a separate group for each AI bot in your robots.txt with Disallow: /. For example: User-agent: GPTBot on one line, Disallow: / on the next, repeated for ClaudeBot, PerplexityBot, Google-Extended, CCBot, and the others. Because named groups override the wildcard, you must list every bot individually — do not use a single User-agent: * Disallow: /, as that would also block Google Search.

How do I block GPTBot in robots.txt specifically?+

Add exactly two lines: User-agent: GPTBot followed by Disallow: /. Use the short token GPTBot, not the full HTTP user-agent string. To also block OpenAI's search and live-fetch bots, add separate groups for OAI-SearchBot and ChatGPT-User.

Does blocking Google-Extended hurt SEO?+

No. Google-Extended only controls whether your content is used for generative AI products like Gemini. It has zero effect on Googlebot, your Google Search indexing, or your organic rankings. You can safely block Google-Extended and keep full Google Search visibility.

Which AI bots should I allow?+

For maximum visibility, allow the search and citation bots: OAI-SearchBot and ChatGPT-User (OpenAI), Claude-SearchBot and Claude-User (Anthropic), and PerplexityBot. Allowing the training bots (GPTBot, ClaudeBot, CCBot, Google-Extended) is optional and depends on whether you're comfortable with your content being used for model training.

Will robots.txt stop AI from using my content?+

It stops compliant, named crawlers from fetching your pages — and all the major AI companies (OpenAI, Anthropic, Google, Perplexity) respect it. It does not technically prevent a non-compliant scraper from reading public pages, nor does it remove content already absorbed into a trained model. For enforcement against bad actors, use server-side or WAF/firewall blocking.

How do I allow ChatGPT to crawl my site?+

Make sure your robots.txt does not disallow OpenAI's bots. To be explicit, allow OAI-SearchBot (ChatGPT search citations) and ChatGPT-User (live fetches when a user asks ChatGPT to read your page), and optionally GPTBot (training). Also confirm your CDN or WAF isn't blocking these user-agents independently of your robots.txt.

What's the difference between GPTBot, OAI-SearchBot, and ChatGPT-User?+

GPTBot crawls content for training OpenAI's models. OAI-SearchBot indexes pages so they can be cited in ChatGPT search. ChatGPT-User fetches a specific URL live when a user asks ChatGPT to read or browse it. They're separately controllable, so you can allow citation while blocking training.

Do AI crawlers respect crawl-delay?+

Support varies. crawl-delay is a non-standard directive that some bots honor and others ignore. It can slow how aggressively a crawler hits your server, but it does not control access. For access control, use Allow/Disallow; for load problems with a specific bot, check that operator's documentation for whether it supports crawl-delay.

How do I check whether AI bots can see my site?+

Fetch your file with curl -A "GPTBot" https://yourdomain.com/robots.txt and read the rules, confirm it returns HTTP 200 as text/plain, and check your CDN/WAF for any AI-blocking toggle. Then run your URL through a free GEO checker such as checkgeoscore.com to see how an AI engine perceives your page and whether anything is blocking it.

Your robots.txt is the gatekeeper for the entire AI era of search. Decide deliberately: allow the AI bots that get you cited, block the ones whose use you object to, and never let a stray Disallow or a forgotten CDN toggle make you invisible to the assistants your audience is already asking. Once your permissions are right, the rest of GEO — semantic structure, structured data, and a clear content map — is what turns access into citations.