robots.txt for AI Crawlers: The Complete Guide

The robots.txt is the first and most important lever for AI visibility. Whether GPTBot, ClaudeBot or PerplexityBot are allowed to crawl your website is decided here — often without website operators knowing it. This guide shows step by step how to configure your robots.txt correctly.

Basics: What is robots.txt?

The robots.txt is a simple text file located in the root directory of your website — at example.com/robots.txt. It contains instructions for web crawlers about which areas of the website may be crawled and which may not.

The underlying protocol is called the Robots Exclusion Protocol and has been an informal web standard since the early 1990s. All reputable crawlers — from Googlebot to GPTBot — respect this file.

Important: The robots.txt is a recommendation, not a technical barrier. Reputable crawlers like GPTBot and Googlebot comply with it. Malicious bots and scrapers generally ignore it. For actual access restrictions, server-side measures such as firewall rules are needed.

Structure of a robots.txt

# Comment — ignored by crawlers User-agent: [Crawler name] Disallow: [Path to block] Allow: [Path to explicitly allow] Sitemap: https://example.com/sitemap.xml

All important AI crawlers at a glance

CrawlerPlatformUser-AgentPurpose
GPTBotChatGPT (OpenAI)GPTBotTraining & knowledge updates
ClaudeBotClaude (Anthropic)ClaudeBotContext understanding & answers
PerplexityBotPerplexity AIPerplexityBotFact-based search with sources
Google-ExtendedGemini (Google)Google-ExtendedGemini & AI Overviews
AmazonbotAlexa (Amazon)AmazonbotAmazon AI products
Applebot-ExtendedSiri (Apple)Applebot-ExtendedApple Intelligence

Step 1: Check your robots.txt

1

Does your robots.txt exist?

Open your browser and visit yourdomain.com/robots.txt. You should see a text file. If you get a 404 error, your website has no robots.txt — not ideal but not a disaster, as crawlers are allowed to crawl everything by default without a robots.txt.

Step 2: Analyse the current state

2

Are AI crawlers allowed or blocked?

Search your robots.txt for the following patterns that block AI crawlers:

Critical — AI crawlers completely blocked:

# PROBLEM: Blocks ALL bots including all AI crawlers User-agent: * Disallow: /

Warning — possibly unintentional: If your robots.txt was generated by a CMS or plugin, AI crawlers may be blocked without your knowledge. Check the file even if you think everything is correctly configured.

Step 3: Configure AI crawlers

3

Choose the right strategy

There are three fundamental approaches — depending on your business model and content:

Strategy A: Allow all AI crawlers completely

Recommended for public websites that want to be cited in AI answers. Maximum AI visibility.

# All crawlers allowed — maximum AI visibility User-agent: * Disallow: Sitemap: https://yourdomain.com/sitemap.xml

Strategy B: Allow AI crawlers selectively

Recommended when you want to protect certain areas (e.g. member area, admin) but keep public content crawlable.

# Public content crawlable, protected areas blocked User-agent: * Disallow: /admin/ Disallow: /members/ Disallow: /api/ Disallow: /wp-admin/ Sitemap: https://yourdomain.com/sitemap.xml

Strategy C: Block AI crawlers specifically

Recommended if you do not want your content used for AI training, but Googlebot should still crawl.

# AI crawlers blocked, Googlebot allowed User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: PerplexityBot Disallow: / User-agent: Google-Extended Disallow: / # Googlebot still allowed User-agent: * Disallow: Sitemap: https://yourdomain.com/sitemap.xml

Step 4: Add the sitemap

4

Sitemap link in robots.txt

The robots.txt is the ideal place to show crawlers the way to the sitemap. Add a sitemap line at the end of the file — this is one of the simplest measures to ensure all important pages of your website are crawled.

# Sitemap link at the end of robots.txt Sitemap: https://yourdomain.com/sitemap.xml # For multiple sitemaps — list them all Sitemap: https://yourdomain.com/sitemap-pages.xml Sitemap: https://yourdomain.com/sitemap-blog.xml

Step 5: Test and validate

5

Check robots.txt for errors

After every change to robots.txt you should validate it. Syntax errors can cause crawlers to ignore the entire file.

  • robots.txt Validator — Checks syntax, AI crawler configuration and sitemap link on ai-ready-check.de
  • Google Search Console — Under "Crawling" → "robots.txt tester" you can test whether Googlebot may crawl specific URLs
  • Direct access — Open yourdomain.com/robots.txt in your browser and check the content is displayed correctly

Tip: After changing robots.txt it takes a few hours to days before crawlers read the new version. Urgent corrections (e.g. accidental blocking) can be accelerated via a Google Search Console entry.

Ready-made templates for every use case

Template: Standard website (maximum visibility)

User-agent: * Disallow: Sitemap: https://yourdomain.com/sitemap.xml

Template: WordPress website

User-agent: * Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /wp-login.php Allow: /wp-admin/admin-ajax.php Sitemap: https://yourdomain.com/sitemap.xml

Template: Protect content but stay public

# Block AI training, allow search User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: PerplexityBot Disallow: / User-agent: Google-Extended Disallow: / # Google Search still allowed User-agent: * Disallow: Sitemap: https://yourdomain.com/sitemap.xml

The 5 most common mistakes

Mistake 1: All bots accidentally blocked

"Disallow: /" for User-agent * blocks not only spam bots but also Googlebot, GPTBot and all other reputable crawlers. The website disappears from all search engines and AI answers. This error often happens when developers block a website during development and forget to remove the block.

Mistake 2: Syntax errors from incorrect formatting

Spaces at the beginning of a line, missing line breaks between blocks or Windows line endings (CRLF instead of LF) can cause parts of the robots.txt to be ignored. The file must be saved as plain UTF-8 text without BOM.

Mistake 3: Sitemap link missing or incorrect

A missing or incorrect sitemap link means crawlers only find pages that are directly linked. Deep subpages and new content may never be crawled.

Mistake 4: Relative instead of absolute paths for the sitemap

The sitemap URL must be given as an absolute path — with https:// and the full domain name. A relative path like "Sitemap: /sitemap.xml" is not correctly interpreted by some crawlers.

Mistake 5: Crawl-delay set too high

An excessively high Crawl-delay value (e.g. "Crawl-delay: 3600") can mean crawlers only visit very few pages per day. For most websites no Crawl-delay is needed.

  • robots.txt is reachable at yourdomain.com/robots.txt
  • No "Disallow: /" for User-agent * without a conscious decision
  • GPTBot, ClaudeBot and PerplexityBot configured according to your strategy
  • Sitemap link with full URL added at the end of the file
  • File saved as UTF-8 without BOM
  • Syntax checked with a validator
  • After changes, request re-crawl in Google Search Console

Check your robots.txt automatically

AI-Ready Check analyses your robots.txt and shows in seconds whether GPTBot, ClaudeBot and PerplexityBot have access — and what you can improve.

Test for free now →

Frequently Asked Questions

What happens if I have no robots.txt?+

Without a robots.txt all crawlers may crawl the entire website by default. This is fine for most public websites. However it is still recommended to create a robots.txt — at least to provide the sitemap link for crawlers and protect admin areas.

Can I block individual pages for AI crawlers?+

Yes — with "Disallow: /path-to-page/" you can block individual URLs or entire directories for AI crawlers. You can also define specific rules for just one crawler (e.g. only GPTBot) while other crawlers can still access those pages.

Do all AI crawlers respect robots.txt?+

All reputable AI crawlers — GPTBot, ClaudeBot, PerplexityBot and Google-Extended — respect robots.txt. Malicious bots and simple scrapers frequently ignore it. For these, robots.txt provides no protection — only server-side measures help.

How often is robots.txt read by AI crawlers?+

Crawlers re-read robots.txt at regular intervals — typically daily or every few days. Changes therefore do not take effect immediately. For urgent corrections you can request an immediate re-crawl of robots.txt via Google Search Console — though this only works for Googlebot.

Does blocking GPTBot affect my Google ranking?+

No — GPTBot and Googlebot are completely independent crawlers. Blocking GPTBot has no influence on Google ranking. You can block GPTBot for AI training while Googlebot continues to crawl and index all pages.