Cadence logo
Payload CMS

How to Make Your Payload CMS Site AI-Ready Right Now

Artsiom Kaplich13 min read
post-1

Unlike Googlebot, AI crawlers flood your site with heavy traffic, look for raw Markdown, and scan for custom metadata. If you're building on Next.js and Payload, implementing these optimizations requires a minimal overhead - but leaving it misconfigured means invisible content for LLMs.

The shift is already measurable. Ahrefs analyzed 300,000 keywords and found that when AI Overviews appear, the top organic result loses 34.5% of its clicks. Their February 2026 follow-up put that figure at 58%. Vercel went from less than 1% to 10% of new signups arriving from ChatGPT in six months. The traffic mix is moving - quietly, but fast.

TL;DR

  • Access first. AI crawlers are blocked by Cloudflare's Bot Fight Mode by default. Unblock search bots at the CDN level, then use robots.ts to explicitly allow search indexers (OAI-SearchBot, Claude-SearchBot, PerplexityBot) while blocking training scrapers (GPTBot, ClaudeBot, Google-Extended). Nothing else here matters until requests reach your server.
  • Serve Markdown via content negotiation. Add a middleware check for Accept: text/markdown and return raw Markdown from a route handler. This cuts token overhead by ~80% versus HTML and makes your content structurally preferred by LLM pipelines — high signal, low effort.
  • Link your structured data into one @graph. Three Schema types — BreadcrumbList for hierarchy, FAQPage for citable facts, and Article linked to Author and Organization — are what AI search bots actually use to assess relevance and authority. Without structured authorship, models have no signal to trust your content over anyone else's.
  • Skip llms.txt unless you run a documentation site. Adoption is ~2% outside developer tooling, Google ignores it entirely, and AI search crawlers don't fetch it unprompted. Spend that time on the Markdown endpoint instead.

How Do You Let AI Crawlers Reach Your Site?

AI bots are blocked by default on Cloudflare's Bot Fight Mode. You have to explicitly unblock them in Cloudflare, then split search bots from training scrapers in your robots.ts.

Before you optimize anything, confirm the crawlers can reach your server.

If you use Cloudflare, Bot Fight Mode is usually on by default and blocks GPTBot and PerplexityBot before requests reach your app. Choose the option for your plan:

  1. Paid plans - flip the AI-bot toggles in Security → Bots.
  2. Free plan - add a WAF Custom Rule with action Skip matching the user agents you want through.

Same goal either way: the request has to reach your origin before anything downstream matters.

On the Next.js side, you handle this in app/robots.ts. A wildcard rule lets everyone in. The smarter move is to welcome AI search while blocking AI training - surface in real-time answers on ChatGPT and Claude, without silently feeding the next round of LLM training.

Each major vendor now ships separate user agents for those two jobs. Split them cleanly.

User-Agent

Purpose

Action

OAI-SearchBot

Search indexer

Allow

ChatGPT-User

Live in-chat fetches

Allow

Claude-SearchBot

Search indexer

Allow

Claude-User

Live in-chat fetches

Allow

PerplexityBot

Conversational search crawler

Allow

GPTBot

Training scraper

Block

ClaudeBot

Training scraper

Block

Google-Extended

Gemini training corpus

Block

unknown node

Deploy, then check your access logs three to five days later. If you don't see successful requests from GPTBot or PerplexityBot, something upstream - like a WAF, CDN, or origin firewall - is still blocking them. It's time to audit your network configuration and locate the bottleneck.

How Do You Expose a Markdown Endpoint for Every Page?

Use content negotiation to serve Markdown. Return raw Markdown when bots hit your URLs with an Accept: text/markdown header.

Sure, AI crawlers can parse HTML. But they process plain Markdown way more efficiently - it means fewer tokens, zero layout noise, and a much cleaner structure. Cloudflare reported roughly 80% fewer tokens on one of their own posts when serving Markdown vs. HTML. That's why LLM pipelines prefer it when it's available. And why a .md version of every public page is the highest-signal surface you can offer them.

The implementation has two parts:

  1. Handle rewrites via middleware (src/proxy.ts). It checks for the Accept: text/markdown header and routes requests to an internal /md/* path. This keeps the URLs clean and uniform for both users and crawlers, eliminating the need for .md extensions or extra canonicals.
  2. A route handler (app/md/posts/[slug]/route.ts) - fetches the document from Payload and returns it as plain Markdown.

buildMarkdown assembles the response: an H1 title, the body, and whatever fields you need (publication date, excerpt, author). What goes inside the body depends on how you store content.

If your collection has a contentMarkdown plain-text field, use it directly. If content lives in Lexical JSON - the default in Payload 3.x - use the built-in convertLexicalToMarkdown from @payloadcms/richtext-lexical.

For complex content models with custom blocks, don't serialize on every request. Add a beforeChange hook that converts Lexical to Markdown at save time and stores it alongside the source. Serialization is a publish-time operation, not a read-time one.

The Vary: Accept header on the Markdown response matters. It tells caches and CDNs to store HTML and Markdown variants separately, so a browser never gets the Markdown body and an agent never gets the HTML page.

Which Structured Data Do AI Search Bots Actually Read?

Three JSON-LD types - BreadcrumbList, FAQPage, and Article linked to Author/Organization - cover the structural, citation, and authority signals AI search bots actually use. Generate them under one @graph per page.

AI search bots prefer content they can parse cleanly. Three specific Schema types handle most of this heavy lifting, and all of them map perfectly onto Payload CMS collections.

Schema

What it signals

Source in Payload

BreadcrumbList

Hierarchy

plugin-nested-docs breadcrumbs

FAQPage

Citation units

faqs collection or array field

Article + Author + Organization

Authority

posts collection + globals

You can drop these into one @graph block or split them across separate JSON-LD scripts - parsers handle both the same way.

Editorial tooling that feeds the same signals

The structured data above is only as good as the content it describes. Two plugins close that loop inside the Payload admin. @payloadcms/plugin-seo exposes auto-generate hooks for title, description, and image — you wire them to your own functions, so editors can regenerate meta fields from the document's actual content rather than filling them in by hand.

For the content itself, payload-ai adds AI-assisted generation directly to Text and RichText fields. The practical value for SEO isn't speed, it's consistency: content drafted with your FAQ schema, author fields, and sameAs links already populated gives the structured data something accurate to point at.

Hierarchy from plugin-nested-docs

@payloadcms/plugin-nested-docs adds two fields to every document in the collections you enable: parent is a self-referencing relationship (editors pick the parent doc), and breadcrumbs is an auto-populated array of ancestors, each entry with a label and a URL.

Hierarchy is one of the strongest signals AI crawlers use. A page at /docs/getting-started/installation already tells a model what it's about and where it fits - before reading a single line of body text. BreadcrumbList JSON-LD turns that path into something machines can read, and plugin-nested-docs builds it for you: every document already knows its full chain of parent pages, with labels and URLs, ready to drop into the schema below. No URL logic to write by hand, and nothing breaks when an editor renames a parent page.

FAQ and Author/Organization

An FAQPage schema packages questions and answers into a single machine-readable unit that AI bots can lift verbatim - attribution included. Think of it as an inverted pyramid: keep the acceptedAnswer to one or two tight sentences, and leave the deep details in the page body for human readers. Long, vague answers don't get cited; concise, specific ones do. In Payload, a dedicated faqs collection or a simple array field gives editors a clean, structured way to manage this.

Article linked to Author and Organization is how machines read authority. A bio written in the page text does nothing for them. The links have to be set in the markup. The Author is a Person with a sameAs array pointing to verified profiles (LinkedIn, GitHub, ORCID); the publisher is an Organization global edited once with its own sameAs and logo. Without structured authorship, you lose the authority signal - leaving the LLM to just guess whether your content is trustworthy.

Two key rules here. First, render everything server-side: your schema must match the DOM. If you hide your FAQ behind a JS accordion that only mounts on click, your JSON-LD will point to phantom markup. Second, keep Organization as a Payload global - edit it once, inject it everywhere, and prevent data drift when your logo or legal name changes. Validate the final output with Google's Rich Results Test before shipping.

When does llms.txt actually matter?

Mostly worth it for developer documentation. AI agents only fetch llms.txt when explicitly pointed at it - and that traffic is almost entirely coding agents reading API docs in real time. For a marketing site or blog: ~2% adoption, Google ignores it, and inference crawlers barely fetch it on their own.

llms.txt is a proposed plain-text index for AI crawlers that lists your site name, description, and grouped links. The idea is to give models a clean map so they find what matters, but most sites don't actually need it. A lot of companies publish one: Cloudflare, Anthropic, Vercel, Stripe, Supabase, and thousands of others. Directories like llmstxthub.com and directory.llmstxt.cloud list who's doing it.

Look at who actually adopts it, though. Almost every meaningful adopter is a developer-tooling company with documentation that AI coding agents need to read in real time. Cloudflare went so far as to embed a literal warning at the top of every doc page: they publish per-product llms.txt and a llms-full.txt for the entire docs surface. The use case is concrete: a developer asks a coding agent to write something against the Workers API and points it at the docs. Only then does the agent fetch llms.txt to find the right page and pull the Markdown version. That flow works because the agent has a continuous, evolving need for an accurate API surface.

That's not how a marketing site or a blog gets read. Visitors arrive, read, leave. AI search bots - the crawlers that collect training data and answer real-time queries - don't have an ongoing relationship with your content the way a coding agent has with a docs site. They just grab a sample and move on, and the numbers show it:

  • ~2% adoption of llms.txt across analyzed sites
  • 1.1% of llms.txt requests come from OAI-SearchBot in a 30-day audit across 1,000 Adobe domains - the rest is Googlebot
  • Google doesn't use llms.txt and has no plans to
  • 94% of crawler hits in the most-cited counter-study came from OAI-SearchBot; GPTBot showed up on just 2 of 8 sites

The pragmatic split:

  • Documentation portal (public API references, SDK guides, technical tutorials) - generate llms.txt and llms-full.txt straight from your CMS collections. The use case is real, and it only takes half a day to ship.
  • Blog or marketing site - skip the directory maps entirely. Instead, spend that time setting up Markdown content negotiation and cleaning up your HTML hierarchy. Those are the only surfaces AI search bots actually care about.

What Changed, In Numbers?

AI Overviews now cost the top organic result up to 58% of its clicks, GEO-style content lifts visibility by up to 40%, and Markdown cuts the tokens an AI crawler reads by ~80%.

Metric

Finding

Source

CTR drop when AI Overview appears (top organic result)

34.5% (April 2025) → 58% (February 2026)

Ahrefs, 300k keywords

Visibility lift from GEO-style content (citations, stats, authority)

Up to +40% in generative responses

Aggarwal et al., KDD 2024

Token reduction when serving Markdown vs HTML

16,180 → 3,150 tokens (~80% less)

Cloudflare

ChatGPT share of Vercel signups

<1% → 4.8% (Mar 2025) → 10% (Apr 2025)

Rauch on X

Two patterns are worth pulling out:

  • The optimization vector inverted. The tactics that get you noticed by generative engines - like direct citations, verifiable statistics, and structured authority - are the exact opposite of legacy SEO. In fact, old-school tricks like keyword stuffing actually tanked visibility in recent GEO studies.
  • Citation share is the new rank. Position #3 vs. #5 on a Google SERP increasingly misses the question that matters: when an AI synthesizes an answer about your category, does it mention you, and how often? Old rank trackers don't measure it.

The Payload setup we've covered - the robots config, Markdown endpoints, and a structured @graph - directly feeds what models actually care about: hierarchy, citable facts, and verifiable authorship. None of this is isolated effort. It all combines into one strong surface. This places your content exactly where LLMs search for answers.

Key Takeaways

On access. AI search bots and training scrapers are different user agents with different jobs, allowing one doesn't mean allowing the other. The split is explicit and configurable.

On content format. Markdown cuts token overhead by ~80% compared to HTML. Serving it via content negotiation costs half a day and makes your content structurally preferred over sites that don't.

On structured data. Three Schema types – BreadcrumbList, FAQPage, and Article linked to Author – cover hierarchy, citation units, and authority. Without structured authorship, the model has no signal to trust your content over anyone else's.

On llms.txt. Only worth implementing if you run a documentation portal. For blogs and marketing sites the fetch rate is ~2%, Google ignores it entirely, and the same effort spent on Markdown endpoints returns more.

On the competitive shift. The tactics that improve GEO visibility: direct citations, verifiable statistics, structured authority — are the opposite of legacy SEO keyword optimization. Optimizing for the old model actively hurts visibility in the new one.

On what's actually being measured now. Traditional rank position is the wrong metric. The question that matters is whether an AI synthesizing an answer in your category cites you, and how often.

AI crawlers are no longer a footnote in your traffic mix. They actively consume your content and provide answers to questions that buyers are already asking. Most sites haven't been adapted yet, which means the structural work covered here — access configuration, Markdown endpoints, linked Schema — still creates a real gap. This isn't a new discipline called "AI SEO." It's the same job content has always had: be readable, be credible, be findable. The audience and the approach to optimization just expanded.

AI-readiness is an architecture decision

The robots config, Markdown endpoints, and structured @graph covered here aren't one-off optimizations — they're decisions baked into how your Payload project is shaped. Getting them right from the start is easier than retrofitting them later, especially as AI crawlers get more selective about what they surface.

If you're building on Payload CMS and want this layer handled properly, FocusReactive specializes in this stack, reach out to hash out the details of your setup.

FAQs

Block training scrapers, not search bots. GPTBot, ClaudeBot, and Google-Extended collect data for model training, so it's fine to disallow them. But OAI-SearchBot, Claude-SearchBot, ChatGPT-User, Claude-User, and PerplexityBot are what put you inside real-time AI answers. Block those and you go invisible in AI search.

Markdown cuts the tokens an AI crawler reads by roughly 80% and strips out layout noise, so models parse your content faster and more accurately. You don't need separate URLs: content negotiation returns Markdown when a bot sends Accept: text/markdown and HTML to everyone else, from the same URL.

Three JSON-LD types cover what AI bots actually use: BreadcrumbList for hierarchy, FAQPage for citable answers, and Article linked to Author and Organization for authority. All three map directly onto Payload data, and you can combine them in a single @graph block.

Probably not, unless you publish developer documentation. Coding agents fetch llms.txt when a developer points them at your docs, but general AI search crawlers don't request it on their own. For a blog or marketing site, your effort is better spent on Markdown endpoints and clean HTML hierarchy.

Yes, and it's a different game. Classic SEO ranking doesn't decide whether an AI cites you. Adding FAQPage schema to a page that already ranks raises its odds of appearing in AI Overviews by around 40%, and structured authority signals lift visibility in generative answers by up to 40%.

Get started

AI-readiness is an architecture decision.

If you're building on Payload CMS and want this layer handled properly, FocusReactive specializes in this stack.