Step-by-Step March 2026

robots.txt for AI Crawlers: The Complete Guide

The robots.txt is the first and most important lever for AI visibility. Whether GPTBot, ClaudeBot or PerplexityBot are allowed to crawl your website is decided here — often without website operators knowing it. This guide shows step by step how to configure your robots.txt correctly.

Basics: What is robots.txt?

The robots.txt is a simple text file located in the root directory of your website — at example.com/robots.txt. It contains instructions for web crawlers about which areas of the website may be crawled and which may not.

The underlying protocol is called the Robots Exclusion Protocol and has been an informal web standard since the early 1990s. All reputable crawlers — from Googlebot to GPTBot — respect this file.

Important: The robots.txt is a recommendation, not a technical barrier. Reputable crawlers like GPTBot and Googlebot comply with it. Malicious bots and scrapers generally ignore it. For actual access restrictions, server-side measures such as firewall rules are needed.

Structure of a robots.txt

# Comment — ignored by crawlers

User-agent: [Crawler name]
Disallow: [Path to block]
Allow: [Path to explicitly allow]

Sitemap: https://example.com/sitemap.xml

All important AI crawlers at a glance

Crawler	Platform	User-Agent	Purpose
GPTBot	ChatGPT (OpenAI)	GPTBot	Training & knowledge updates
ClaudeBot	Claude (Anthropic)	ClaudeBot	Context understanding & answers
PerplexityBot	Perplexity AI	PerplexityBot	Fact-based search with sources
Google-Extended	Gemini (Google)	Google-Extended	Gemini & AI Overviews
Amazonbot	Alexa (Amazon)	Amazonbot	Amazon AI products
Applebot-Extended	Siri (Apple)	Applebot-Extended	Apple Intelligence

Step 1: Check your robots.txt

Does your robots.txt exist?

Open your browser and visit yourdomain.com/robots.txt. You should see a text file. If you get a 404 error, your website has no robots.txt — not ideal but not a disaster, as crawlers are allowed to crawl everything by default without a robots.txt.

Step 2: Analyse the current state

Are AI crawlers allowed or blocked?

Search your robots.txt for the following patterns that block AI crawlers:

Critical — AI crawlers completely blocked:

# PROBLEM: Blocks ALL bots including all AI crawlers
User-agent: *
Disallow: /

Warning — possibly unintentional: If your robots.txt was generated by a CMS or plugin, AI crawlers may be blocked without your knowledge. Check the file even if you think everything is correctly configured.

Step 3: Configure AI crawlers

Choose the right strategy

There are three fundamental approaches — depending on your business model and content:

Strategy A: Allow all AI crawlers completely

Recommended for public websites that want to be cited in AI answers. Maximum AI visibility.

# All crawlers allowed — maximum AI visibility

User-agent: *
Disallow:

Sitemap: https://yourdomain.com/sitemap.xml

Strategy B: Allow AI crawlers selectively

Recommended when you want to protect certain areas (e.g. member area, admin) but keep public content crawlable.

# Public content crawlable, protected areas blocked

User-agent: *
Disallow: /admin/
Disallow: /members/
Disallow: /api/
Disallow: /wp-admin/

Sitemap: https://yourdomain.com/sitemap.xml

Strategy C: Block AI crawlers specifically

Recommended if you do not want your content used for AI training, but Googlebot should still crawl.

# AI crawlers blocked, Googlebot allowed

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# Googlebot still allowed
User-agent: *
Disallow:

Sitemap: https://yourdomain.com/sitemap.xml

Step 4: Add the sitemap

Sitemap link in robots.txt

The robots.txt is the ideal place to show crawlers the way to the sitemap. Add a sitemap line at the end of the file — this is one of the simplest measures to ensure all important pages of your website are crawled.

# Sitemap link at the end of robots.txt
Sitemap: https://yourdomain.com/sitemap.xml

# For multiple sitemaps — list them all
Sitemap: https://yourdomain.com/sitemap-pages.xml
Sitemap: https://yourdomain.com/sitemap-blog.xml

Step 5: Test and validate

Check robots.txt for errors

After every change to robots.txt you should validate it. Syntax errors can cause crawlers to ignore the entire file.

robots.txt Validator — Checks syntax, AI crawler configuration and sitemap link on ai-ready-check.de
Google Search Console — Under "Crawling" → "robots.txt tester" you can test whether Googlebot may crawl specific URLs
Direct access — Open yourdomain.com/robots.txt in your browser and check the content is displayed correctly

Tip: After changing robots.txt it takes a few hours to days before crawlers read the new version. Urgent corrections (e.g. accidental blocking) can be accelerated via a Google Search Console entry.

Ready-made templates for every use case

Template: Standard website (maximum visibility)

User-agent: *
Disallow:

Sitemap: https://yourdomain.com/sitemap.xml

Template: WordPress website

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-login.php
Allow: /wp-admin/admin-ajax.php

Sitemap: https://yourdomain.com/sitemap.xml

Template: Protect content but stay public

# Block AI training, allow search
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# Google Search still allowed
User-agent: *
Disallow:

Sitemap: https://yourdomain.com/sitemap.xml

The 5 most common mistakes

Mistake 1: All bots accidentally blocked

"Disallow: /" for User-agent * blocks not only spam bots but also Googlebot, GPTBot and all other reputable crawlers. The website disappears from all search engines and AI answers. This error often happens when developers block a website during development and forget to remove the block.

Mistake 2: Syntax errors from incorrect formatting

Spaces at the beginning of a line, missing line breaks between blocks or Windows line endings (CRLF instead of LF) can cause parts of the robots.txt to be ignored. The file must be saved as plain UTF-8 text without BOM.

Mistake 3: Sitemap link missing or incorrect

A missing or incorrect sitemap link means crawlers only find pages that are directly linked. Deep subpages and new content may never be crawled.

Mistake 4: Relative instead of absolute paths for the sitemap

The sitemap URL must be given as an absolute path — with https:// and the full domain name. A relative path like "Sitemap: /sitemap.xml" is not correctly interpreted by some crawlers.

Mistake 5: Crawl-delay set too high

An excessively high Crawl-delay value (e.g. "Crawl-delay: 3600") can mean crawlers only visit very few pages per day. For most websites no Crawl-delay is needed.

robots.txt is reachable at yourdomain.com/robots.txt
No "Disallow: /" for User-agent * without a conscious decision
GPTBot, ClaudeBot and PerplexityBot configured according to your strategy
Sitemap link with full URL added at the end of the file
File saved as UTF-8 without BOM
Syntax checked with a validator
After changes, request re-crawl in Google Search Console

Check your robots.txt automatically

AI-Ready Check analyses your robots.txt and shows in seconds whether GPTBot, ClaudeBot and PerplexityBot have access — and what you can improve.

Test for free now →

Frequently Asked Questions

What happens if I have no robots.txt?+

Without a robots.txt all crawlers may crawl the entire website by default. This is fine for most public websites. However it is still recommended to create a robots.txt — at least to provide the sitemap link for crawlers and protect admin areas.

Can I block individual pages for AI crawlers?+

Yes — with "Disallow: /path-to-page/" you can block individual URLs or entire directories for AI crawlers. You can also define specific rules for just one crawler (e.g. only GPTBot) while other crawlers can still access those pages.

Do all AI crawlers respect robots.txt?+

All reputable AI crawlers — GPTBot, ClaudeBot, PerplexityBot and Google-Extended — respect robots.txt. Malicious bots and simple scrapers frequently ignore it. For these, robots.txt provides no protection — only server-side measures help.

How often is robots.txt read by AI crawlers?+

Crawlers re-read robots.txt at regular intervals — typically daily or every few days. Changes therefore do not take effect immediately. For urgent corrections you can request an immediate re-crawl of robots.txt via Google Search Console — though this only works for Googlebot.

Does blocking GPTBot affect my Google ranking?+

No — GPTBot and Googlebot are completely independent crawlers. Blocking GPTBot has no influence on Google ranking. You can block GPTBot for AI training while Googlebot continues to crawl and index all pages.