Overview: All AI crawlers at a glance
Every major AI platform operates its own web crawlers to collect content for training, knowledge updates or real-time search. These crawlers identify themselves via their user agent string and respect robots.txt – but beyond that, they differ significantly in crawl frequency, purpose, transparency and what they actually do with your content.
Core principle for all crawlers: They all respect robots.txt, read only the HTML source without JavaScript rendering (with partial exceptions) and prefer technically clean, fast websites. The differences lie in crawl frequency, purpose, whether they link back to your site and the degree of transparency each provider offers.
As of 2026, eight major AI crawlers are actively indexing the web. Understanding each one lets you make informed decisions about who gets access to your content – and who doesn't.
GPTBot – ChatGPT (OpenAI)
GPTBot
GPTBot is the best-known and most thoroughly documented AI crawler. OpenAI officially introduced it in August 2023 and publishes both a documentation page and the current IP address ranges – making it one of the most verifiable AI crawlers.
GPTBot crawls for two purposes: training future GPT models and updating the knowledge of already deployed models. Crawl frequency is moderate compared to Googlebot – important pages are typically visited on a cycle of weeks to months rather than days.
One key consideration: allowing GPTBot means your content may appear in ChatGPT answers, but ChatGPT rarely links directly to sources. You gain AI visibility but not referral traffic. If your goal is traffic over brand presence in AI answers, this trade-off is worth thinking about.
robots.txt control: User-agent: GPTBot – fully supported and reliably respected by OpenAI.
ClaudeBot – Claude (Anthropic)
ClaudeBot
ClaudeBot is Anthropic's crawler for the Claude language model. It operates on the same basic principles as GPTBot: reading HTML source, respecting robots.txt, no JavaScript rendering. Anthropic does not publish IP ranges, making it harder to verify crawl activity independently.
A distinctive feature of ClaudeBot is its focus on contextual understanding – Anthropic places great emphasis on Claude understanding relationships, nuance and long-form reasoning rather than just retrieving isolated facts. This means well-structured, content-rich pages with clear hierarchy are particularly valued.
Claude increasingly includes source citations in its answers, especially in its web search mode. This makes ClaudeBot more valuable for traffic than GPTBot in certain contexts.
robots.txt control: User-agent: ClaudeBot – fully supported.
PerplexityBot – Perplexity AI
PerplexityBot
PerplexityBot differs from GPTBot and ClaudeBot in one critical respect: Perplexity is primarily a search engine with AI answers – not a pure chatbot. This means PerplexityBot crawls more actively and more frequently than training crawlers, often on a days-to-weeks cycle.
Perplexity cites sources directly and visibly in every answer, with clickable links back to the original pages. A citation by Perplexity is therefore the most traffic-valuable AI citation available right now. Any website wanting to appear as a Perplexity source needs technically clean, citable pages and an allowed PerplexityBot.
For most websites with public informational content, PerplexityBot is the single most important AI crawler to prioritise. Even if you block training crawlers, keeping PerplexityBot allowed is usually worthwhile.
robots.txt control: User-agent: PerplexityBot – respected. Perplexity also operates a secondary crawler called Perplexity-User for real-time lookups triggered by user queries.
Google-Extended – Gemini & AI Overviews
Google-Extended
Google-Extended is a separate crawler from Google for AI-specific purposes, introduced in September 2023. It is completely distinct from Googlebot, which crawls for classic Google Search. Google-Extended collects data for training Gemini and for generating Google AI Overviews – the AI answers appearing at the top of search results.
The key advantage of Google-Extended: it can be controlled entirely independently from Googlebot. Blocking Google-Extended has no direct impact on your Google Search ranking – Googlebot continues crawling regardless. This gives website owners a genuine choice about AI training participation without SEO risk.
If you appear in AI Overviews, Google does include source citations. This makes Google-Extended particularly valuable for high-traffic informational queries where AI Overviews appear prominently.
robots.txt control: User-agent: Google-Extended – fully supported and independently controllable from Googlebot.
xAI-Bot – Grok (xAI)
xAI-Bot / Grok
Grok is the AI model from xAI, Elon Musk's AI company, deeply integrated into the X (formerly Twitter) platform. The web crawler identifies itself as xAI-Bot and crawls for both training and knowledge updates.
xAI is the least transparent of the major AI providers: no published IP ranges, limited official documentation and less clear communication about crawling scope or data usage. This makes it harder to verify whether robots.txt rules are reliably respected.
A key differentiator: Grok has real-time access to X posts and can incorporate social media signals directly into answers – independently of web crawling. This means your X presence and web presence work together for Grok visibility in a way that doesn't apply to other AI platforms.
robots.txt control: User-agent: xAI-Bot – respected according to current knowledge, but less verifiable than other crawlers.
Amazonbot – Amazon / Alexa AI
Amazonbot
Amazonbot is Amazon's web crawler, operating primarily for Alexa voice assistant answers and Amazon's broader AI initiatives. Amazon publishes its IP address ranges, which allows server-side verification of crawler authenticity – a notable plus for transparency.
While Alexa's web-based smart speaker market share has declined, Amazon is actively integrating AI into its shopping, AWS and Alexa ecosystems. Amazonbot's crawl scope reflects this: it focuses on factual, structured content that can be used to answer voice queries and power AI features across Amazon's product line.
For most websites, Amazonbot has lower immediate traffic impact than PerplexityBot or Google-Extended. However, e-commerce sites, local businesses and informational content producers benefit from maintaining Alexa visibility, especially as Amazon expands its AI answer features.
robots.txt control: User-agent: Amazonbot – fully supported. Amazon publishes clear documentation and IP ranges for verification.
Applebot-Extended – Apple Intelligence / Siri
Applebot-Extended
Applebot-Extended is Apple's dedicated AI training crawler, introduced in 2024 alongside the rollout of Apple Intelligence. It is completely separate from the regular Applebot, which crawls for Spotlight search and Safari Suggestions. The two can be controlled independently.
Apple introduced this separation deliberately to give website owners control over AI training participation without affecting their presence in Apple's regular search features. Blocking Applebot-Extended does not affect Spotlight indexing or Siri's ability to surface your website as a regular result.
Apple Intelligence is tightly integrated into iOS, iPadOS and macOS – a user base of over a billion active devices. As Apple Intelligence expands its capabilities, Applebot-Extended's strategic importance will increase significantly. Websites targeting iOS users in particular should consider their Applebot-Extended policy carefully.
Apple publishes IP ranges for verification, making Applebot-Extended one of the more transparent AI crawlers despite being relatively new.
robots.txt control: User-agent: Applebot-Extended – fully supported and independently controllable from regular Applebot.
Bytespider – ByteDance / TikTok AI
Bytespider
Bytespider is ByteDance's web crawler – the company behind TikTok and the large language model family used in their AI products. It has been observed crawling at very high volumes, in some cases more aggressively than other AI crawlers, which has led to concerns among webmasters and hosting providers.
ByteDance does not publish IP ranges or comprehensive documentation for Bytespider, making it the least transparent of all major AI crawlers. There is no clear public-facing AI product that surfaces web content with source citations – Bytespider appears to be primarily a training data crawler.
Given the lack of transparency, the absence of source links and reported aggressive crawl behaviour, many website owners choose to block Bytespider by default unless they have a specific reason to allow it. This is a reasonable precaution that has no known negative impact on any user-facing AI search product.
robots.txt control: User-agent: Bytespider – listed as respected, but less verifiable due to limited documentation.
Direct comparison: all crawlers in one table
| Crawler | User agent | JS rendering | Crawl freq. | Source links | IP ranges | Transparency | robots.txt | Recommendation |
|---|---|---|---|---|---|---|---|---|
| GPTBot | GPTBot | No | Weeks–months | No | Yes | High | Yes | Allow for AI presence |
| ClaudeBot | ClaudeBot | No | Weeks–months | Partial | No | Medium | Yes | Allow for AI presence |
| PerplexityBot | PerplexityBot | No | Days–weeks | Yes | Yes | Medium | Yes | Highest priority |
| Google-Extended | Google-Extended | Partial | Regular | Yes (AI Overviews) | Yes | High | Yes | Allow for AI Overviews |
| xAI-Bot | xAI-Bot | No | Unknown | Partial | No | Low | Yes | Optional |
| Amazonbot | Amazonbot | No | Weeks | No | Yes | Medium | Yes | Allow if targeting Alexa |
| Applebot-Extended | Applebot-Extended | No | Weeks–months | No | Yes | High | Yes | Allow for iOS audience |
| Bytespider | Bytespider | No | High / aggressive | No | No | Low | Unclear | Block by default |
💡 Tip: Use the robots.txt Validator to check whether your current configuration correctly controls each of these crawlers. Capitalisation in user agent names matters – GPTBot and gptbot are treated differently.
Which strategy is right for you?
Maximum AI visibility – allow all except Bytespider
For public websites with informational content, service providers, blogs and tools: allow all crawlers except Bytespider. This is the most sensible default for anyone wanting to be cited in as many AI answers as possible. Bytespider's lack of transparency and aggressive crawl behaviour makes it the one exception worth blocking in most cases.
Selective – prioritise search-based crawlers
If your primary goal is clickable traffic from AI sources, prioritise PerplexityBot and Google-Extended. These are the two crawlers where a citation directly translates into a link back to your website. GPTBot, ClaudeBot and Amazonbot build AI presence without driving direct traffic.
Block training, allow search
Anyone who does not want their content used for AI model training – but wants to appear in real-time AI search results – can block GPTBot, ClaudeBot, Applebot-Extended and Amazonbot while keeping PerplexityBot and Google-Extended active. This separates the training use case from the search visibility use case.
Block all AI crawlers
For websites with copyrighted, paid or sensitive content, blocking all AI crawlers is a reasonable choice. This is a conscious decision against AI visibility and carries the trade-off of not appearing in any AI-generated answers. For publishers concerned about content use without compensation, this may be the right call.
Ready-made robots.txt templates
Allow all AI crawlers (maximum visibility)
Recommended default: allow all, block Bytespider
Block training crawlers only (allow search)
Block all AI crawlers
- Make a conscious decision: which AI crawlers should have access to your content?
- Check robots.txt for correct user agent names – capitalisation matters (GPTBot not gptbot)
- Add Perplexity-User alongside PerplexityBot if you want to block all Perplexity access
- Blocking Google-Extended does not affect your Google Search ranking
- Blocking Applebot-Extended does not affect Spotlight or regular Siri results
- Test robots.txt with a validator after every change
- TTFB under 800ms – so all crawlers can fully read your pages
Which AI crawlers have access to your website?
AI-Ready Check analyses in seconds whether GPTBot, ClaudeBot, PerplexityBot, Google-Extended and more are correctly configured – free, no account needed.
Test for free now →
