Should You Block AI Crawlers Like GPTBot? A 2026 Guide
Blocking GPTBot stops AI training, but blocking OAI-SearchBot or PerplexityBot quietly costs you AI citations. Here's how to tell them apart and decide.
For most sites, the answer is: block the crawlers that only train models, and leave the ones that send you traffic and citations alone. The mistake people make is treating "AI bots" as one group and blanket-blocking all of them, which quietly removes their pages from ChatGPT Search, Perplexity, and other answer engines that would otherwise have cited them.
The catch is that the bots don't announce their purpose. They just show up with a user-agent string. So before you add anything to your robots.txt, you need to know which crawler does what.
What are the two kinds of AI crawlers?
There are two jobs an AI company's crawler can be doing, and they have opposite consequences for you.
Training crawlers collect text to train or improve a model. The content goes into a dataset. You get nothing back, no link, no traffic, no citation. Blocking these costs you almost nothing in visibility.
Answer-engine crawlers fetch pages in real time to answer a user's live question, the same way a search engine does. When one of these reads your page and the assistant cites it, you can earn a click and a brand mention. Block these and you disappear from that surface entirely.
The trick is that the same company often runs both. OpenAI is the clearest example: GPTBot trains models, while OAI-SearchBot powers ChatGPT Search results. Treat them identically and you will either leak training data you wanted to keep, or block the citations you wanted to win.
Which bots are which?
Here are the user-agents worth knowing in 2026, grouped by what they actually do.
| User-agent | Operator | Purpose | Block it? |
|---|---|---|---|
| GPTBot | OpenAI | Model training | Optional |
| CCBot | Common Crawl | Open dataset used to train many models | Optional |
| Bytespider | ByteDance | Training (aggressive crawler) | Often yes |
| Google-Extended | Gemini training opt-out token | Optional | |
| OAI-SearchBot | OpenAI | ChatGPT Search answers | Usually no |
| PerplexityBot | Perplexity | Perplexity answers and citations | Usually no |
| ClaudeBot | Anthropic | Training and retrieval | Depends |
A few notes. CCBot belongs to Common Crawl, a nonprofit whose public dataset feeds a huge number of downstream models, so blocking it is the broadest single way to opt out of training. Google-Extended is not a real crawler that visits you; it is a token Google reads in robots.txt to decide whether your already-crawled content can train Gemini, which means blocking it does not affect your normal Google Search ranking. Bytespider has a reputation for crawling hard and ignoring niceties, so many sites block it purely to save server load.
How do you actually block one?
Training and answer-engine crawlers that behave themselves obey robots.txt. You target a bot by its user-agent name:
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: OAI-SearchBot
Allow: /That example opts out of OpenAI's training and the Common Crawl dataset while keeping the door open for ChatGPT Search. The Allow: / line is optional since allowing is the default, but stating it makes your intent obvious to anyone reading the file later.
One detail that trips people up: robots.txt rules are matched per user-agent, and the most specific matching group wins. A broad User-agent: * Disallow does not automatically catch named bots that have their own group, and it can accidentally catch ones you meant to allow. If you are editing these by hand, confirm the result with a robots.txt tester so you can see exactly which bot is allowed or blocked on a given path before you ship it. If you are starting from scratch, a robots.txt generator will lay out the per-bot groups correctly.
What robots.txt cannot do
Here is the important limit. Robots.txt is a request, not a wall. A well-behaved crawler reads it and complies. A crawler that wants your content can ignore it completely, and there is no technical penalty for doing so. Several AI scrapers have been caught crawling sites that disallowed them.
If you genuinely need to stop a bot rather than politely ask it to leave, you enforce that at the edge, in your CDN or web application firewall (WAF). An edge block inspects the incoming request and refuses it before it reaches your application, usually by matching the user-agent string or the source IP range. Cloudflare, Fastly, and similar providers now ship one-click toggles to block known AI crawlers this way.
The two methods solve different problems:
| robots.txt | WAF / CDN edge block | |
|---|---|---|
| Mechanism | Polite request | Enforced rejection |
| Obeyed by bad actors? | No | Yes |
| Affects load? | No, page still served if ignored | Yes, request refused at edge |
| Risk | Low | Can block legitimate bots by mistake |
The risk with edge blocks is real. User-agent strings can be spoofed, and an overly broad IP rule can accidentally block Googlebot, an uptime monitor, or a paying customer behind a shared address. Use robots.txt for intent, and reserve WAF rules for bots that have proven they will not respect it.
So what should most sites do?
For a typical business or content site, a sensible default is to allow answer-engine crawlers and decide on training crawlers based on how you feel about your work being used to train models for free.
- You want AI citations and traffic: allow OAI-SearchBot, PerplexityBot, and other retrieval bots. This is the whole point of getting cited by ChatGPT, Perplexity and AI Overviews.
- You are protective of original content: block GPTBot, CCBot, and Google-Extended to opt out of training while keeping answer engines on.
- You are fighting load or scraping abuse: block the aggressive crawlers like Bytespider at the WAF, not just in robots.txt.
Whatever you choose, the failure mode to avoid is a stale, copy-pasted robots.txt that blocks every "bot" you have ever heard of. That is how good sites accidentally vanish from AI search. If you are not sure what your current file is actually allowing, the AI bot access checker reads your robots.txt and tells you, bot by bot, which AI crawlers can reach your pages today.
Curious where else your setup might be helping or hurting your visibility? A free SEO audit checks your robots.txt, crawlability, and AI-readiness together in about a minute, no signup required.
Kaustav Basak is the creator of SEO AI Audits, a free AI-powered SEO toolkit. He writes about technical SEO, Core Web Vitals, and how search is changing in the age of AI assistants.
Put this into practice — free
Run a complete, AI-powered SEO audit of your site in about a minute. No signup.