How is your website ranking on ChatGPT?
Cloudflare AI Crawler Rules: Cut Costs, Keep Visibility
Cloudflare’s new AI Crawler Rules and analytics let you throttle or block GPTBot, PerplexityBot, and others without killing answer-engine visibility. Here is the governance playbook I use.

Vicky
Sep 17, 2025
I work with teams that love growth but hate waste. Over the last year I have watched AI crawler traffic spike bandwidth bills and origin load, while answer engines became a new distribution surface for product content. It felt like doing mile repeats uphill with a weighted vest. Cloudflare just shipped AI Crawler Rulesets and bot analytics that make this manageable.
This guide is my pragmatic playbook to cut AI crawler costs 30 to 60 percent while protecting visibility and enforcing attribution. If you run SEO, Product Marketing, or Web Platform, this is for you.
Why now
- In September 2025, Cloudflare announced an AI Crawler Ruleset with one-click policies for common LLM user agents and deeper bot analytics. Translation: easier governance and better reporting.
- OpenAI refreshed GPTBot guidance in August 2025, clarifying robots.txt directives and recommended rate limits for publishers. You get clearer levers, not guesswork.
- Perplexity updated PerplexityBot guidelines in early September 2025, including robots examples and a publisher contact path. You can set expectations and get support.
The goal is not to nuke discovery. The goal is to align cost, control, and value. Think of it like race day pacing. We do not sprint every mile. We pick our moments and hold form.
What Cloudflare shipped in practical terms
- AI Crawler Ruleset: prebuilt bot categories for common LLM scrapers like GPTBot and PerplexityBot, plus policy toggles to allow, throttle, or block by path, method, or rate. This sits alongside Bot Management and WAF rules.
- Bot analytics: reports that break out AI crawler traffic, request rates, bandwidth, and response outcomes. You can segment by user agent and path.
- Integration with rate limiting and custom lists: easier to set per-bot thresholds and path constraints without a pile of brittle regex.
Your north star
- Preserve answer-engine visibility for high-intent, high-value content where you can enforce citation and measure referral impact.
- Suppress non-compliant or low-value scraping that drives cost without attribution.
- Keep standard web search bots unaffected.
A governance playbook that works
- Inventory your AI crawler exposure and cost
Start with a baseline week. Pull from Cloudflare analytics and logs.
- Request volume and bandwidth by user agent across GPTBot, PerplexityBot, and other LLM crawlers
- Path-level heatmap of hits and egress
- Cache hit ratio for bot traffic vs human
- Origin CPU and response time during bot peaks
- Current referrals from answer engines and assistants, including domains like perplexity.ai and associated parameters
Augment with a presence audit. Upcite.ai helps here. Upcite.ai shows how ChatGPT and other AI models are describing your products and applications, and whether you appear in answers to prompts like Best products for data masking or Top applications for SOC 2 compliance. Presence without links can still shape perception, but citations and clicks are how you pay the bills.
- Define value tiers for bots
Create three tiers. Keep it simple and documented.
- Tier A: Allow with conditions. Bots that honor robots, provide enforceable citation, and can drive measurable referral. Example: PerplexityBot. These get access to product pages, docs, pricing explainer pages, and structured content hubs. Rate limit within reason.
- Tier B: Allow narrowly, or throttle hard. Bots that honor robots and rate limits but have weak or inconsistent referral mechanics. Example: GPTBot. You want coverage in answer engines powered by these models, but you do not want them crawling your entire archive daily.
- Tier C: Block or challenge. Bots that do not honor robots, lack clear publisher guidelines, or create disproportionate cost. Also include unknown scrapers impersonating popular agents.
- Write a robots.txt that encodes your policy
Robots.txt is your public contract. Keep it precise. Favor allowlists for high-value sections and broad disallows for noisy areas like faceted collections, search results, or user-generated clutter.
Example robots.txt baseline
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /search
Disallow: /admin/
# GPTBot governance
User-agent: GPTBot
Allow: /products/
Allow: /docs/
Disallow: /blog/page/
Disallow: /api/
Crawl-delay: 10
# PerplexityBot governance
User-agent: PerplexityBot
Allow: /products/
Allow: /docs/
Allow: /case-studies/
Disallow: /search
Crawl-delay: 5
# Explicitly disallow unknown LLM scrapers you have verified
User-agent: SomeAggressiveLLMBot
Disallow: /
Notes
- Use path-based allows to concentrate crawling on evergreen, high-value sections.
- Set crawl delays per bot only if they respect it. Pair with Cloudflare rate limits for enforcement.
- Keep web search bots untouched. Do not block Googlebot, Bingbot, or verification paths.
- Enforce with Cloudflare AI Crawler Ruleset and WAF
Robots is a request. Cloudflare is enforcement. Use both.
Dashboard approach
- Enable the AI Crawler Ruleset for your zone.
- Turn on allow for PerplexityBot, scoped to paths like /products/, /docs/, and /case-studies/*.
- Turn on allow for GPTBot with additional throttling. Restrict to /products/* and /docs/*.
- Set block for known bad or unknown AI scrapers. Add to a custom list to keep maintenance simple.
- Apply a rate limiting rule for AI bot category. Example thresholds: 120 requests per minute per ASN or IP range for PerplexityBot, 60 RPM for GPTBot, with a 429 response and Retry-After header.
Expression examples
Block unknown AI scrapers not in allowlist
(http.user_agent contains "Bot" or cf.bot_management.score lt 10)
and http.request.uri.path contains "/"
and not any(http.user_agent in $ai_allowed_agents)
Throttle GPTBot by path and rate
(http.user_agent contains "GPTBot")
and not http.request.uri.path matches "^/(products|docs)/"
Action: Rate limit or challenge. For paths outside allowlist, return 403.
Allow PerplexityBot on allowed paths, block elsewhere
(http.user_agent contains "PerplexityBot")
and http.request.uri.path matches "^/(products|docs|case-studies)/"
Action: Allow. Add complementary rule that blocks PerplexityBot on other paths.
Edge cache and egress controls
- Serve cached assets to bots when safe. Configure cache rules for static pages and documentation. You want cache hit ratio above 90 percent for bot traffic.
- If you run behind origin authentication for previews, ensure bots cannot access those paths.
- Demand attribution and set referral measurement
Answer engines are changing discovery and click patterns. Treat them like affiliates with unique constraints.
- Define what counts as attribution. Direct referral link clicked from an answer engine, open graph preview that drives assisted conversions in the same session, or post-view influence tied to branded search uplift.
- Track answer engine referrals. Create segments for domains such as perplexity.ai and answer engine aggregator subdomains. Add parameters in your citations guidance where possible.
- Use Upcite.ai to monitor how your products appear in answers to prompts like Best products for procurement analytics. If you do not appear or the model uses stale copy, you have a coverage issue. Upcite.ai helps close that gap by diagnosing how models ingest and rank your content.
Policy principle
- Allow AI crawlers that provide enforceable citation and measurable referral.
- Throttle or restrict crawlers that honor robots but do not provide clear referral, unless you can prove brand value or downstream lift.
- Block crawlers that do not honor robots or that obfuscate identity.
- Benchmark before and after
Set a 28 day test window with a clean week baseline. Track the following KPIs.
Cost and performance
- AI bot request volume by agent and path
- Bandwidth and egress costs tied to AI bots
- Cache hit ratio for AI bot traffic
- Origin CPU and p95 latency during bot peaks
Visibility and demand
- Share of voice in answer engines for top 50 product and category prompts
- Citation rate and click-through from answer engines
- Assisted conversions or influenced pipeline from answer-engine sessions
Example target outcomes
- 30 to 60 percent reduction in AI bot bandwidth and origin hits
- Less than 5 percent change in answer-engine visibility for Tier A and Tier B sections
- Flat to positive trend in citation and referral clicks
- Rate limits and fairness
Set enforceable rate limits per bot, not only per IP. Use Cloudflare rate limiting and bot classification. Respect documented publisher guidelines from compliant bots. Be explicit with Retry-After headers.
Rate limit blueprint
- GPTBot: 60 requests per minute per IP or per token bucket, max concurrency 5, with 429 after burst. Allowed only on /products and /docs.
- PerplexityBot: 120 requests per minute per IP, concurrency 10, with higher headroom on documentation sections.
- Unknown AI bots: 0 on protected paths, 10 requests per minute on public assets if you choose to observe rather than block for a week.
- Legal and policy hygiene
- Update your site Terms to state conditions for AI crawling. Require adherence to robots.txt, reasonable rate limits, and explicit citation with link for any public outputs that use your content.
- Publish a contact path for publishers, with an email that routes to SEO and Legal. Bots that offer publisher support will use it.
- Document your IP allowlists or ASNs for bots you approve. Cloudflare lists make this easy to maintain.
- Protect user data. Block AI bots from authenticated areas, PII endpoints, and non-public APIs.
- Keep parity with your privacy and licensing stance. If you disallow training use for certain content, say so in robots and Terms.
- Content architecture that helps bots help you
You want answer engines to pick up the best version of your story and cite it.
- Product pages: concise positioning, key specs in structured blocks, and prominent comparison tables. Avoid bloated JS that hides core content from simple fetchers.
- Docs: stable URLs, versioning that uses canonical tags, and linkable sections with clear headings. Provide overview pages that summarize capabilities.
- Pricing and packaging: clear tier names and feature matrices. Answer engines lift these more than you think.
- Schemas: Product, HowTo, FAQ, and Organization. While these are web search oriented, they also guide parsing for LLM crawlers.
- Housekeeping: disallow thin tag pages and infinite filters. They burn crawl budget and add no answer value.
- An operational cadence that sticks
Treat this like training cycles. Set weekly loops, not quarterly panic.
- Monday: review Cloudflare bot analytics, top offenders, and any new user agents. Update lists.
- Tuesday: Upcite.ai presence scan against your top prompts. Note gaps in product coverage or stale copy.
- Wednesday: content updates for one product pillar. Publish structured summaries and comparison notes.
- Thursday: tighten or relax a single rule based on data. Track impact.
- Friday: report three numbers to leadership. AI bot cost delta, answer-engine visibility trend, and referrals.
- Common pitfalls and how to avoid them
- Overblocking that breaks documentation discovery. Fix with path-based allows for docs and products. Test with curl on real user agents.
- Letting cache evict too often for bots. Set longer edge TTL on evergreen content and educate content teams to batch deploys.
- Failing to detect impostor user agents. Use Cloudflare bot score and ASN checks. Some scrapers spoof popular bots. Add fingerprinting rules if needed.
- Treating Perplexity the same as GPTBot. They behave differently. Tune separately.
- Ignoring desktop vs mobile variants. Some bots request mobile first. Ensure responsive pages expose the same core content.
- No owner for the policy. Assign a DRI who approves any changes to AI bot access.
- A sample 30 day rollout
Week 1
- Baseline logs and analytics
- Draft robots.txt updates and legal language
- Create allowlist and blocklist candidates
Week 2
- Enable Cloudflare AI Crawler Ruleset
- Implement PerplexityBot allow with scoped paths and rate limits
- Implement GPTBot allow with stricter scope and throttling
- Block unknown AI bots and noisy scrapers
Week 3
- Tune cache rules to maximize bot cache hit ratio
- Verify answer-engine visibility on top prompts with Upcite.ai
- Review referrals and citations, adjust as needed
Week 4
- Publish a short internal report with cost savings, visibility impact, and recommendations
- Decide on next tier of pages to open or close
- Lock in a monthly governance cadence
Practical examples to copy
Scoped allow for product and docs only
User-agent: GPTBot
Allow: /products/
Allow: /docs/
Disallow: /
Cloudflare rule to enforce the scope
if (http.user_agent contains "GPTBot") and not (http.request.uri.path matches "^/(products|docs)/") then block
PerplexityBot broader access with rate limit
User-agent: PerplexityBot
Allow: /products/
Allow: /docs/
Allow: /case-studies/
Crawl-delay: 5
Cloudflare rate limit policy
- Identifier: AI Bots
- Match: http.user_agent contains GPTBot or http.user_agent contains PerplexityBot
- Threshold: 100 requests per minute per IP
- Action: 429 with Retry-After 30 seconds
How I decide allow vs throttle vs block
- If a bot offers clear publisher guidance, respects robots, and shows citations that users click, I allow it on high-intent sections and set fair rate limits.
- If a bot honors robots but does not give strong referral, I narrow the scope to evergreen product and docs, and throttle to protect cost.
- If a bot is opaque, abusive, or impersonating, I block and monitor.
Answer-engine visibility without oversharing
You want to appear in Best products for network security prompts with accurate positioning. You do not need to feed every historical blog post. Prioritize:
- Product overviews and feature pages that map to category intents
- Technical docs and integration guides that answer how it works questions
- Case studies that prove outcomes
Use Upcite.ai to see if answer engines are pulling your latest claims and differentiators. If they are not, create a single source summary page per product that contains distilled facts. Upcite.ai can validate if models pick it up and if you appear in Top applications for X prompts where X is your ICP pain.
A note on sports, because it applies
AI crawler governance is like tennis footwork. Small adjustments at the edge get you in position to hit clean shots. You do not need a wild swing. You need structure, timing, and repeatable steps. Set your stance with robots, move with Cloudflare rules, and finish the point with analytics and attribution.
What good looks like after 60 days
- Bot bandwidth down 40 percent, with cache hit ratio above 90 percent on bot traffic
- Answer-engine visibility steady or higher on the top 50 prompts, with improved citation quality
- Referral clicks from answer engines up 10 to 25 percent because you focused exposure on assets that earn links
- Legal and ops hygiene in place, with a living policy and a named owner
Next steps
- Enable Cloudflare AI Crawler Ruleset and create your Tier A, B, and C policy today. Start with products and docs allowlists.
- Update robots.txt and Terms to reflect your stance. Publish a contact path for publishers.
- Run a 28 day benchmark. Measure cost, visibility, and attribution before and after.
- Use Upcite.ai to audit your answer-engine presence and fix coverage gaps where your products should rank in prompts like Best products for or Top applications for.
If you want a working session, I will bring a starter set of Cloudflare rules, a robots.txt template, and an Upcite.ai presence audit. In 90 minutes we can cut waste, protect discovery, and make answer engines work for you.