Basics: What is robots.txt?
The robots.txt is a simple text file located in the root directory of your website — at example.com/robots.txt. It contains instructions for web crawlers about which areas of the website may be crawled and which may not.
The underlying protocol is called the Robots Exclusion Protocol and has been an informal web standard since the early 1990s. All reputable crawlers — from Googlebot to GPTBot — respect this file.
Important: The robots.txt is a recommendation, not a technical barrier. Reputable crawlers like GPTBot and Googlebot comply with it. Malicious bots and scrapers generally ignore it. For actual access restrictions, server-side measures such as firewall rules are needed.
Structure of a robots.txt
All important AI crawlers at a glance
| Crawler | Platform | User-Agent | Purpose |
|---|---|---|---|
| GPTBot | ChatGPT (OpenAI) | GPTBot | Training & knowledge updates |
| ClaudeBot | Claude (Anthropic) | ClaudeBot | Context understanding & answers |
| PerplexityBot | Perplexity AI | PerplexityBot | Fact-based search with sources |
| Google-Extended | Gemini (Google) | Google-Extended | Gemini & AI Overviews |
| Amazonbot | Alexa (Amazon) | Amazonbot | Amazon AI products |
| Applebot-Extended | Siri (Apple) | Applebot-Extended | Apple Intelligence |
Step 1: Check your robots.txt
Does your robots.txt exist?
Open your browser and visit yourdomain.com/robots.txt. You should see a text file. If you get a 404 error, your website has no robots.txt — not ideal but not a disaster, as crawlers are allowed to crawl everything by default without a robots.txt.
Step 2: Analyse the current state
Are AI crawlers allowed or blocked?
Search your robots.txt for the following patterns that block AI crawlers:
Critical — AI crawlers completely blocked:
Warning — possibly unintentional: If your robots.txt was generated by a CMS or plugin, AI crawlers may be blocked without your knowledge. Check the file even if you think everything is correctly configured.
Step 3: Configure AI crawlers
Choose the right strategy
There are three fundamental approaches — depending on your business model and content:
Strategy A: Allow all AI crawlers completely
Recommended for public websites that want to be cited in AI answers. Maximum AI visibility.
Strategy B: Allow AI crawlers selectively
Recommended when you want to protect certain areas (e.g. member area, admin) but keep public content crawlable.
Strategy C: Block AI crawlers specifically
Recommended if you do not want your content used for AI training, but Googlebot should still crawl.
Step 4: Add the sitemap
Sitemap link in robots.txt
The robots.txt is the ideal place to show crawlers the way to the sitemap. Add a sitemap line at the end of the file — this is one of the simplest measures to ensure all important pages of your website are crawled.
Step 5: Test and validate
Check robots.txt for errors
After every change to robots.txt you should validate it. Syntax errors can cause crawlers to ignore the entire file.
- robots.txt Validator — Checks syntax, AI crawler configuration and sitemap link on ai-ready-check.de
- Google Search Console — Under "Crawling" → "robots.txt tester" you can test whether Googlebot may crawl specific URLs
- Direct access — Open yourdomain.com/robots.txt in your browser and check the content is displayed correctly
Tip: After changing robots.txt it takes a few hours to days before crawlers read the new version. Urgent corrections (e.g. accidental blocking) can be accelerated via a Google Search Console entry.
Ready-made templates for every use case
Template: Standard website (maximum visibility)
Template: WordPress website
Template: Protect content but stay public
The 5 most common mistakes
Mistake 1: All bots accidentally blocked
"Disallow: /" for User-agent * blocks not only spam bots but also Googlebot, GPTBot and all other reputable crawlers. The website disappears from all search engines and AI answers. This error often happens when developers block a website during development and forget to remove the block.
Mistake 2: Syntax errors from incorrect formatting
Spaces at the beginning of a line, missing line breaks between blocks or Windows line endings (CRLF instead of LF) can cause parts of the robots.txt to be ignored. The file must be saved as plain UTF-8 text without BOM.
Mistake 3: Sitemap link missing or incorrect
A missing or incorrect sitemap link means crawlers only find pages that are directly linked. Deep subpages and new content may never be crawled.
Mistake 4: Relative instead of absolute paths for the sitemap
The sitemap URL must be given as an absolute path — with https:// and the full domain name. A relative path like "Sitemap: /sitemap.xml" is not correctly interpreted by some crawlers.
Mistake 5: Crawl-delay set too high
An excessively high Crawl-delay value (e.g. "Crawl-delay: 3600") can mean crawlers only visit very few pages per day. For most websites no Crawl-delay is needed.
- robots.txt is reachable at yourdomain.com/robots.txt
- No "Disallow: /" for User-agent * without a conscious decision
- GPTBot, ClaudeBot and PerplexityBot configured according to your strategy
- Sitemap link with full URL added at the end of the file
- File saved as UTF-8 without BOM
- Syntax checked with a validator
- After changes, request re-crawl in Google Search Console
Check your robots.txt automatically
AI-Ready Check analyses your robots.txt and shows in seconds whether GPTBot, ClaudeBot and PerplexityBot have access — and what you can improve.
Test for free now →