The Complete Guide to AI Bot Detection on WordPress — How RocketCite Identifies 30+ AI Crawlers

The AI crawler ecosystem has gotten complicated fast. What started as a handful of identifiable bots has expanded into a sprawling taxonomy of training crawlers, search indexing bots, and real-time user-fetch bots from more than a dozen AI companies. If you’re a WordPress publisher who wants to manage how your content gets accessed by these systems, accurate identification is where everything begins.

The three-tier bot architecture

The major AI companies have largely converged on a three-bot architecture, though each has implemented it differently. Understanding this structure is essential before setting up any access rules.

OpenAI operates three bots: GPTBot for training data collection, OAI-SearchBot for ChatGPT Search indexing, and ChatGPT-User for on-demand fetching when a user asks ChatGPT to visit a URL directly. Each can be controlled independently via robots.txt.

Anthropic operates ClaudeBot for training, Claude-SearchBot for search indexing (launched March 2025), and Claude-User for on-demand fetching. The same separation applies.

Google uses a single Googlebot for core indexing, but introduced Google-Extended as a robots.txt control token for opting out of AI training and Gemini product improvements. Here’s a nuance worth knowing: blocking Google-Extended does not affect AI Overviews or AI Mode. It only affects training. Publishers who believe they’ve opted out of AI Overviews by blocking Google-Extended have not done so.

Perplexity operates PerplexityBot for indexing and Perplexity-User for fetching. Independent security research has documented Perplexity-User using generic browser user-agents to avoid identification, which is why IP-based verification matters alongside user-agent matching. RocketCite Pro covers this in more detail.

Why are user-fetch bots are a different problem?

The -User bots above don’t behave like crawlers. They’re triggered by individual user actions in real time, which means they can visit URLs at any time, in any sequence, and in significant volume when content surfaces in AI search contexts. Some, as documented in Perplexity’s case, have declared that they don’t generally respect robots.txt since they’re acting on behalf of a human user making a request.

This isn’t a simple allow/block situation. It calls for nuanced access control at the content level, which is exactly what RocketCite’s per-post access rules address. Those are covered in a separate feature overview.

The complete bot registry in RocketCite

RocketCite’s free plugin ships with user-agent signatures for 30+ AI bots.

From the major US AI companies: GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, Claude-User, Google-Extended, Google-CloudVertexBot, Gemini-Deep-Research, Meta-ExternalAgent, Meta-ExternalFetcher, and Applebot-Extended.

From AI search and other services: PerplexityBot, CCBot, Bytespider, Amazonbot, DuckAssistBot, MistralAI-User, Cohere-ai, AI2Bot, Diffbot, Webz.io, YouBot, and Timpibot.

The registry is stored as a JSON configuration file and is fully hookable. Developers can add custom bot signatures via a rocketcite_bot_registry filter, and detection results are exposed via a rocketcite_detected_bot action hook.

Why server-side detection matters

JavaScript analytics and CDN-level solutions both have real blind spots here. JavaScript analytics miss requests that don’t execute JavaScript, which covers most AI bot traffic. CDN-cached responses may be served to bots without the origin server ever knowing the visit happened.

RocketCite’s detection runs in PHP on WordPress’s init and send_headers hooks, which fire before any page rendering occurs. Every request that reaches WordPress gets evaluated, and the response can be modified, logged, or redirected based on the detected bot identity before any content database queries run.

Bot categorization and the admin interface

In the RocketCite admin panel, detected bots appear in a categorized list: training bots grouped separately from search indexing bots and user-fetch bots. Each bot can be individually enabled or disabled. The registry displays each bot’s operator, category, and current detection status.

For publishers who want to allow search indexing while blocking training, or who want to permit some training bots while excluding others, this makes the configuration a matter of a few clicks rather than manual robots.txt work.

When bots don’t respect your robots.txt anymore

Bytespider, operated by ByteDance, is worth calling out specifically. Security research has documented it scraping at rates 25 times higher than GPTBot and up to 3,000 times higher than Anthropic’s crawlers. It has been observed ignoring robots.txt and rotating IP addresses to evade blocking. RocketCite includes Bytespider in its default registry and flags it as a case requiring server-level blocking rather than robots.txt rules alone, with guidance for common hosting and CDN configurations.

RocketCite is currently in development. Sign up below to be notified when the free plugin launches on WordPress.org.

Anna van Bergeijk

Digital Marketeer

Working in digital marketing for almost 10 years. Now working at Rocket Venture Labs boosting marketing for RocketCite and RocketCart. Love hikes and great food!