Why Your Old Robots.txt Is No Longer Enough in 2026
Robots.txt was designed in 1994 to tell search engine crawlers like Googlebot which pages to index and which to leave alone. For three decades, that simple two-instruction file User-agent and Disallow was all most websites needed. That era is over.
In 2026, your website is being crawled by a completely different category of bot. GPTBot crawls pages to train OpenAI's models and supply ChatGPT search with source material. ClaudeBot does the same for Anthropic. PerplexityBot powers Perplexity's real-time AI search answers. Google-Extended feeds Google's Gemini models. Together, requests from GPTBot and ClaudeBot alone now equal approximately 20% of Googlebot's monthly request volume and that number is growing every month.
None of these bots are covered by a standard robots.txt that only mentions Googlebot and Bingbot. If your robots.txt has not been updated since 2023, every one of these AI crawlers is accessing your entire site by default, including pages you may not want used for AI training, scraped for content generation, or summarized without attribution. This generator fixes that in under two minutes.
What Is llms.txt and Why Do You Need One?
llms.txt is a new standard file, proposed in 2024 and rapidly adopted in 2025 and 2026, that tells AI language models how your content may be used. Where robots.txt controls whether a bot can visit a page, llms.txt controls what an AI system is permitted to do with your content after it visits whether it can summarize it, cite it, train on it, or include it in generated answers.
Think of it this way: robots.txt is a gate. llms.txt is a terms-of-service notice posted at the gate explaining what visitors are allowed to do inside. Both are necessary. robots.txt manages access. llms.txt manages usage rights and AI answer behavior.
An llms.txt file lives at the root of your domain (yourdomain.com/llms.txt) and uses a simple markdown-like format to describe your site, list your important pages with brief descriptions, and specify content usage rules. A typical llms.txt file looks like this:
# MySite — Developer Tools Platform
> A collection of free developer tools for web developers and engineers.
## Important Pages
- [Blog](https://mysite.com/blog): Technical articles and tutorials
- [Tools](https://mysite.com/tools): Free web development utilities
- [About](https://mysite.com/about): Information about the platform
## Usage Policy
AI models may cite and summarize content from this site.
Content may not be used for commercial AI training without permission.
Always attribute content to: mysite.comNot all AI systems honor llms.txt instructions unlike robots.txt, which most crawlers respect strictly, llms.txt is a voluntary standard. But the major platforms including OpenAI, Anthropic, and Perplexity have committed to reading and respecting it. Getting your llms.txt in place now positions you correctly as the standard becomes more widely enforced.
The AI Bots You Need to Know in 2026
This is the complete list of AI and LLM crawlers that are actively visiting websites in 2026. Each one has a specific User-agent string you use in robots.txt to control its access:
GPTBot: OpenAI's crawler. Used to train GPT models and supply ChatGPT search with real-time source material. One of the highest-volume AI crawlers.
User-agent: GPTBotClaudeBot: Anthropic's crawler for Claude model training and Anthropic search features.
User-agent: ClaudeBotPerplexityBot: Crawls pages to supply Perplexity AI search with source content for its real-time answer generation. Allowing this bot means your content can be cited in Perplexity answers.
User-agent: PerplexityBotGoogle-Extended: Google's dedicated crawler for Gemini AI model training. Separate from the standard Googlebot that handles traditional search indexing. Blocking Google-Extended has no effect on your Google Search rankings it only affects Gemini training data.
User-agent: Google-ExtendedApplebot-Extended: Apple's crawler for Apple Intelligence features.
User-agent: Applebot-ExtendedDuckAssistBot: DuckDuckGo's crawler for AI-assisted answers.
User-agent: DuckAssistBotMeta-ExternalAgent: Meta's crawler for training AI models including Llama.
User-agent: Meta-ExternalAgent
Should You Block or Allow AI Bots? The Real Answer
This is the question every site owner is asking in 2026, and the honest answer is: it depends on your goals, not on a blanket rule. Here is how to think through it clearly.
Allow AI bots if you want visibility in AI search results. When you allow GPTBot, ClaudeBot, or PerplexityBot to crawl your site, you increase the chance that your content appears in AI-generated answers, gets cited in Perplexity search results, and is referenced when users ask AI assistants questions related to your domain. For bloggers, publishers, tool sites, and anyone who builds their business on content visibility, allowing AI crawlers is generally the right choice. It is the foundation of Answer Engine Optimization (AEO) the strategy of getting your content cited in AI answers, not just ranked in blue-link search results. Our detailed guide on AEO vs traditional SEO covers exactly how this visibility works and why it matters for traffic in 2026.
Block AI bots if you want to protect proprietary content. If your site contains original research, licensed content, paid subscription material, or any content where unauthorized AI training would harm your business, blocking the training-focused crawlers makes sense. You can block GPTBot (which trains models) while allowing PerplexityBot (which only uses content for real-time answers, not training). These are independent user-agents and can be set separately.
The one thing you should never do: Leave your robots.txt unchanged from 2023. Whether you choose to allow or block AI bots, the explicit choice is always better than the implicit default of allowing everything.
How to Use This Robots.txt and LLMs.txt Generator
Choose your crawl strategy: Select from the platform tabs Allow All (maximum AI search visibility), Block AI Training (allows answer bots, blocks model training), Block All AI (maximum content protection), or Custom (set each bot individually).
Configure your sitemap URL: Enter the full URL to your sitemap XML file. This gets added to your robots.txt automatically it is one of the fastest signals you can give any crawler, including AI bots, about your site structure.
Set your restricted paths: Add any directory paths you want to block for all bots admin panels, staging areas, login pages, API endpoints, and any private content.
Fill in your llms.txt details: Enter your site name, a brief description, your key pages with short labels, and your content usage policy. The tool generates the correctly formatted llms.txt file from your inputs.
Download both files: Click Generate to download your robots.txt and llms.txt files. Place both in the root directory of your domain and verify them live at
yourdomain.com/robots.txtandyourdomain.com/llms.txt.
Before deploying, make sure your meta tags and Open Graph data are also in good shape AI crawlers read metadata as part of their content evaluation. Our SEO Meta Tag and Open Graph Generator handles that in one step, and pairs well with this tool for a complete technical SEO setup.
Robots.txt Syntax Reference: Every Rule You Need
Writing correct robots.txt syntax is straightforward once you know the four core directives. Here is every rule this generator uses, with plain-English explanations:
User-agent: *- Applies the following rules to every bot that reads the file. Use this for global rules that apply to all crawlers.User-agent: GPTBot- Applies the following rules only to OpenAI's GPTBot crawler. Rules for specific bots override the wildcard rules for that bot.Disallow: /admin/- Tells the bot it is not permitted to crawl any URL that starts with/admin/. The trailing slash is important it blocks the directory and everything inside it.Disallow: /- Blocks the bot from crawling your entire site. When used under a specific User-agent, it blocks only that bot.Allow: /blog/- Explicitly permits access to a path even if a broader Disallow rule would otherwise block it. Use this when you want to block most of a directory but allow specific sections.Sitemap: https://yourdomain.com/sitemap.xml- Tells every crawler where your sitemap lives. Always include this. It is the single highest-value line you can add to robots.txt for discoverability.Crawl-delay: 10- Asks the bot to wait 10 seconds between requests. Note that Googlebot ignores this directive use Google Search Console to manage Googlebot crawl rate instead.
Automating Your Sitemap and Crawl Workflow
Once your robots.txt and llms.txt are deployed, the next step for most sites is making sure your sitemap stays current and gets pinged to search engines and AI crawlers whenever you publish new content. If you run scheduled content updates, database cleanups, or automated sitemap regeneration on your server, our Cron Job Expression Generator makes it easy to build the correct schedule expression for any platform including Vercel, GitHub Actions, and AWS EventBridge — with plain-English explanations of when each job will fire.
Optimizing Your LLMs.txt for AI Search Visibility
A good llms.txt file does more than just list your pages. It gives AI systems enough context to understand what your site is authoritative about, which increases the likelihood that your content gets cited when a user asks a relevant question. A few things that improve your llms.txt quality:
Be specific about your site's expertise: Instead of "a website about technology," write "a free developer tools platform specializing in web development utilities, SEO tools, and code generators." The more precisely you describe your domain expertise, the more confidently AI systems can cite you for relevant queries.
Prioritize your highest-value pages: List the pages that best represent your expertise at the top of the page list. AI systems reading llms.txt treat the order as a signal of priority.
Keep descriptions under 150 characters per page: Concise, specific descriptions work better than long ones. Our Word and Character Counter helps you check the length of each description as you draft the file.
State your usage policy clearly: Whether you allow citation, summarization, and training or restrict any of these, be explicit. Ambiguous policies are treated as "allow all" by most AI systems.
Once you have both files deployed and your content strategy aligned with AI search visibility, the next level is making sure your prompts and content structure are optimized for how AI tools actually extract and cite information. Our AI Prompt Optimizer helps you structure content and prompts in the format that AI systems find easiest to extract clean, citable answers from.
Common Robots.txt Mistakes That Hurt Your Site in 2026
No Sitemap directive: Forgetting to add
Sitemap:to your robots.txt means crawlers have to discover your pages through link crawling alone. Adding the sitemap URL is the highest-value single line in the file.Blocking CSS and JavaScript: Google's crawler needs to render your pages to understand them. Blocking
/static/or/assets/directories prevents proper rendering and hurts your search rankings directly.Using Crawl-delay for Googlebot: Googlebot ignores the Crawl-delay directive entirely. If you need to manage Googlebot's crawl rate, use the dedicated crawl rate setting in Google Search Console.
Accidental wildcard blocks: A misplaced
Disallow: /underUser-agent: *blocks your entire site from every crawler. This is the single most common high-severity robots.txt mistake and it can tank your search traffic overnight.No rules for AI bots: The biggest 2026-specific mistake. Running a robots.txt with no entries for GPTBot, ClaudeBot, PerplexityBot, or Google-Extended means you have made no decision about AI crawler access and the default is allow everything.
Why WebToolsHub?
Every tool on WebToolsHub runs entirely in your browser with no server-side processing, no account required, and no data stored or transmitted. The robots.txt and llms.txt files you generate here are built locally in your browser and downloaded directly to your device. Your site configuration, URLs, and content policy details never leave your machine.




