X (Twitter) Facebook Pinterest LinkedIn E-mail

Cloudflare has announced the private beta of AI Index, a next-generation web index for domains that promises making content discoverable for AI with editor control, while providing AI builders with access to a structured, real-time, and fairly compensated feed. The concept is as simple as it is ambitious: if the web of the future is consulted by agents and models, sites should be able to decide how their content is accessed, with clear rules and monetization options; and AI teams should be able to subscribe to “origin” changes without wasting resources on indiscriminate crawling.

With AI Index enabled, Cloudflare automatically creates an AI-optimized index of the domain (owned by the client), exposes ready-to-use standard APIs — MCP server, LLMs.txt, LLMs-full.txt, Search API, Bulk Data API, and a pub/sub channel — and integrates it with AI Crawl Control to see who accesses, set permissions, establish policies, and if desired, charge for access through Pay per crawl and new x402 integrations. For individual indexes, the company will build an aggregated layer — Open Index — that groups participating sites for broader searches without losing control or the ability to participate.

Why now: from “crawled” web to “subscribed” web

Chatbots, agents, and generative search experiences have become primary sources for discovering information. The problem: the current flow largely depends on blind crawling, with disparate policies and limited control by creators. Publishers lack an efficient way to signal changes to AI providers; and for teams training or serving models, recrawling unstructured content costs time and money, with no prior visibility of quality or cost.

Cloudflare proposes a paradigm shift: from indiscriminate crawling to a permissioned pub/sub. Sites willing to participate opt-in, expose a structured index, publish update events when content changes, and define rules and prices. Instead of “scraping” the entire web, AI builders discover domains with active indexes, evaluate metadata (e.g., uniqueness, depth, relevance, popularity), pay for access when necessary, and subscribe to changes for fresh data, avoiding frequent recrawls.

What AI Index provides when activated on a domain

When a Cloudflare customer onboards or enables the feature on an existing domain, the platform builds and maintains an AI-optimized index for that site. The process uses the same technological base as Cloudflare AI Search (formerly AutoRAG) and the website as data source connector:

Real-time processing of new or updated pages, with automatic handling of storage, embeddings, chunking, models, and computing resources.
Granular inclusion/exclusion control: the editor decides what enters, what remains outside, who accesses, and how. Total index deactivation is possible at any moment.
Standard APIs for immediate consumption:
- MCP Server (Model Context Protocol): agents can connect directly and standardized. Supports NLWeb tools (open protocol by Microsoft for natural language queries over sites).
- Search API: flexible, with structured JSON results based on relevance.
- LLMs.txt and LLMs-full.txt: standard files providing models with a machine-readable map of the site during inference time (Cloudflare already shares an example in their docs).
- Bulk Data API: massive content ingestion under publisher rules, reducing the need for multiple document-by-document reads.
- Pub/Sub: event subscriptions and payloads for real-time changes, allowing providers to stay updated without constant recrawls.
- Discoverability directives: entries in robots.txt and .well-known enabling recognized agents and crawlers to discover and use APIs automatically.

The index is integrated with AI Crawl Control, providing visibility into access, policies, and permissions; it also complements Pay per crawl and x402 for direct monetization. The site owner has full control over who, how, and how much.

For AI developers: a per permission web feed

Creators of agents or AI platforms will be able to discover and subscribe to high-quality web data with explicit permission via individual indexes:

Discover sites that have opted to expose their indexes (navigable directory with filters).
Evaluate content before access (metadata: uniqueness, depth, relevance, popularity).
Pay a fair price for access (Pay per crawl) when appropriate, with revenue flow to the site.
Subscribe to changes, receiving real-time events and avoiding recrawls.

This approach reduces costs (less crawling, less duplication), accelerates times (only changes are processed), improves quality (structured data), and respects editor intent. Access is always at the discretion of the domain owner.

Open Index: unified search at scale (with control and income from the bottom up)

Managing dozens or hundreds of subscriptions per site can become complex when searching broadly. To address this, Cloudflare will launch Open Index, an aggregated, opt-in collection of individual indexes accessible from a single portal:

Unified access: query and retrieve data from many participating sites simultaneously; ideal as a web search layer ready for use and as a curated collection.
Thematic scopes: packets based on news, documentation, scientific research, etc., or a general index for broad exploration.
Monetization upward: results from individual site indexes, with compensation flowing back via Pay per crawl.

Thus, builders can choose: precision and full text with per-site indexes (for training, agents, first-party search experiences), or broad coverage with Open Index when scale and quick discovery are needed.

What each actor gains

Content creators and publishers

Total control: decide what to expose, to whom, under what conditions, and how to audit access.
Visibility: a direct pipeline for agents and LLMs to discover and use their content in a standardized way.
Revenue: Pay per crawl/x402 to monetize access without opaque agreements.

AI builders (teams, platforms, integrators)

Quality and freshness: pub/sub subscriptions to structured changes, fewer false positives and recrawls.
Efficiency: lower cost per query, with predictable quality and price per source.
Compliance: direct relationship with site owner, with explicit permissions and traceability.

ecosystem

From “crawling everything and seeing what falls” to connecting with participating sources; from “presumed use” to permissioned and compensated access. A healthier framework for generative web.

How it will work in practice (flow overview)

Onboarding: the domain owner activates AI Index via Cloudflare.
Index construction: the system processes the site (using AI Search tech), creates embeddings, APIs (MCP, Search, Bulk, LLMs.txt, pub/sub), and applies AI Crawl Control.
Rules and monetization: the publisher sets inclusion/exclusion, permissions, pricing, and x402.
Discovery: builders find the domain in the directory, review metadata, and subscribe (or query).
Updates: the site sends real-time events; the provider consumes bulk data or triggers a query; if applicable, pays for access and records traceability.
Aggregation: the publisher can choose to enable Open Index for greater discoverability; they continue to maintain control and compensation.

Key questions and answers

Is activating AI Index mandatory when using Cloudflare?
No. It’s opt-in. The publisher chooses to enable it, decides what content to index, and who can access. It can also disable the feature entirely.

What standards are supported for agents and LLMs?
Includes MCP (Model Context Protocol) for direct agent application connections, support for NLWeb tools (open standard for natural language queries), LLMs.txt and LLMs-full.txt files for machine-readable site maps during inference, and discoverability directives in robots.txt and .well-known.

How is monetization and access traceability managed?
Through Pay per crawl and x402, enabling pay-for-access. AI Crawl Control offers auditing, policies, and permissions. Payments flow back to the origin site, even when accessed via Open Index.

What is the advantage of the pub/sub model over traditional crawling?
It reduces costs and latency: providers receive structured events when content changes, avoiding periodic recrawls that waste CPU and stress servers. It also allows for quality metrics (uniqueness, depth, relevance) before purchasing access.

Can a site demand usage rules or withdraw its content?
Yes. The publisher controls policies (what, whom, how, how much) and can opt out entirely. Access always remains at the domain owner’s discretion.

Future steps and how to participate

Cloudflare is starting with a private beta. Publishers interested in activating AI Index and builders seeking to consume the feed (per-domain indexes or Open Index) can register today to be considered. The vision: a web where sites decide how their content fuels AI, and where agents receive reliable, structured, permissioned data at scale.

Context: Cloudflare positions AI Index within its connectivity cloud, a platform that protects corporate networks, accelerates apps at Internet scale, mitigates DDoS attacks, blocks intrusions, and supports the shift to Zero Trust. With AI Index and Open Index, the company aims for a more equitable ecosystem among creators, models, and agents.

Frequently Asked Questions

What exactly are LLMs.txt and how do they differ from robots.txt?
LLMs.txt (and LLMs-full.txt) are machine-readable files describing how a site’s content should be consumed by an LLM during inference (e.g., relevant routes, formats, limits). Robots.txt guides crawling, while LLMs.txt guides model consumption.

Can I use the index to improve my website’s internal search?
Yes. The domain’s index — owned by the publisher — can be used for modern search experiences within the site, as well as exposing APIs for external agents.

How will AI providers know my site offers AI Index?
Via discoverability directives in robots.txt and .well-known routes, and via the directory of sites that opt to publish their index. MCP agents can also discover the endpoint automatically.

What if I change my mind about monetization or access?
Policies are dynamic. The publisher can adjust rules and prices, revoke permissions, or exit the program. Control always remains at the domain’s owner.

How is privacy and regulatory compliance protected?
The publisher decides what content to index and exclude. Access is managed with policies, permissions, and auditing (AI Crawl Control). For sensitive or regulated data, it’s recommended to filter/exclude or set strict conditions.

via: blog.cloudflare

X (Twitter) Facebook Pinterest LinkedIn E-mail