Make Your Payload CMS Site AI-Ready with Next.js Optimizations

Unlike Googlebot, AI crawlers flood your site with heavy traffic, look for raw Markdown, and scan for custom metadata. If you're building on Next.js and Payload, implementing these optimizations requires a minimal overhead - but leaving it misconfigured means invisible content for LLMs.

The shift is already measurable. Ahrefs analyzed 300,000 keywords and found that when AI Overviews appear, the top organic result loses 34.5% of its clicks. Their February 2026 follow-up put that figure at 58%. Vercel went from less than 1% to 10% of new signups arriving from ChatGPT in six months. The traffic mix is moving - quietly, but fast.

TL;DR

Access first. AI crawlers are blocked by Cloudflare's Bot Fight Mode by default. Unblock search bots at the CDN level, then use robots.ts to explicitly allow search indexers (OAI-SearchBot, Claude-SearchBot, PerplexityBot) while blocking training scrapers (GPTBot, ClaudeBot, Google-Extended). Nothing else here matters until requests reach your server.
Serve Markdown via content negotiation. Add a middleware check for Accept: text/markdown and return raw Markdown from a route handler. This cuts token overhead by ~80% versus HTML and makes your content structurally preferred by LLM pipelines — high signal, low effort.
Link your structured data into one @graph. Three Schema types — BreadcrumbList for hierarchy, FAQPage for citable facts, and Article linked to Author and Organization — are what AI search bots actually use to assess relevance and authority. Without structured authorship, models have no signal to trust your content over anyone else's.
Skip llms.txt unless you run a documentation site. Adoption is ~2% outside developer tooling, Google ignores it entirely, and AI search crawlers don't fetch it unprompted. Spend that time on the Markdown endpoint instead.

How Do You Let AI Crawlers Reach Your Site?

AI bots are blocked by default on Cloudflare's Bot Fight Mode. You have to explicitly unblock them in Cloudflare, then split search bots from training scrapers in your robots.ts.

Before you optimize anything, confirm the crawlers can reach your server.

If you use Cloudflare, Bot Fight Mode is usually on by default and blocks GPTBot and PerplexityBot before requests reach your app. Choose the option for your plan:

Paid plans - flip the AI-bot toggles in Security → Bots.
Free plan - add a WAF Custom Rule with action Skip matching the user agents you want through.

Same goal either way: the request has to reach your origin before anything downstream matters.

On the Next.js side, you handle this in app/robots.ts. A wildcard rule lets everyone in. The smarter move is to welcome AI search while blocking AI training - surface in real-time answers on ChatGPT and Claude, without silently feeding the next round of LLM training.

Each major vendor now ships separate user agents for those two jobs. Split them cleanly.

User-Agent	Purpose	Action
`OAI-SearchBot`	Search indexer	Allow
`ChatGPT-User`	Live in-chat fetches	Allow
`Claude-SearchBot`	Search indexer	Allow
`Claude-User`	Live in-chat fetches	Allow
`PerplexityBot`	Conversational search crawler	Allow
`GPTBot`	Training scraper	Block
`ClaudeBot`	Training scraper	Block
`Google-Extended`	Gemini training corpus	Block

Deploy, then check your access logs three to five days later. If you don't see successful requests from GPTBot or PerplexityBot, something upstream - like a WAF, CDN, or origin firewall - is still blocking them. It's time to audit your network configuration and locate the bottleneck.

How Do You Expose a Markdown Endpoint for Every Page?

Use content negotiation to serve Markdown. Return raw Markdown when bots hit your URLs with an Accept: text/markdown header.

Sure, AI crawlers can parse HTML. But they process plain Markdown way more efficiently - it means fewer tokens, zero layout noise, and a much cleaner structure. Cloudflare reported roughly 80% fewer tokens on one of their own posts when serving Markdown vs. HTML. That's why LLM pipelines prefer it when it's available. And why a .md version of every public page is the highest-signal surface you can offer them.

The implementation has two parts:

Handle rewrites via middleware (src/proxy.ts). It checks for the Accept: text/markdown header and routes requests to an internal /md/* path. This keeps the URLs clean and uniform for both users and crawlers, eliminating the need for .md extensions or extra canonicals.
A route handler (app/md/posts/[slug]/route.ts) - fetches the document from Payload and returns it as plain Markdown.

buildMarkdown assembles the response: an H1 title, the body, and whatever fields you need (publication date, excerpt, author). What goes inside the body depends on how you store content.

If your collection has a contentMarkdown plain-text field, use it directly. If content lives in Lexical JSON - the default in Payload 3.x - use the built-in convertLexicalToMarkdown from @payloadcms/richtext-lexical.

For complex content models with custom blocks, don't serialize on every request. Add a beforeChange hook that converts Lexical to Markdown at save time and stores it alongside the source. Serialization is a publish-time operation, not a read-time one.

The Vary: Accept header on the Markdown response matters. It tells caches and CDNs to store HTML and Markdown variants separately, so a browser never gets the Markdown body and an agent never gets the HTML page.

Which Structured Data Do AI Search Bots Actually Read?

Three JSON-LD types - BreadcrumbList, FAQPage, and Article linked to Author/Organization - cover the structural, citation, and authority signals AI search bots actually use. Generate them under one @graph per page.

AI search bots prefer content they can parse cleanly. Three specific Schema types handle most of this heavy lifting, and all of them map perfectly onto Payload CMS collections.

Schema	What it signals	Source in Payload
`BreadcrumbList`	Hierarchy	`plugin-nested-docs` `breadcrumbs`
`FAQPage`	Citation units	`faqs` collection or array field
`Article` + `Author` + `Organization`	Authority	`posts` collection + globals

You can drop these into one @graph block or split them across separate JSON-LD scripts - parsers handle both the same way.

Editorial tooling that feeds the same signals

The structured data above is only as good as the content it describes. Two plugins close that loop inside the Payload admin. @payloadcms/plugin-seo exposes auto-generate hooks for title, description, and image — you wire them to your own functions, so editors can regenerate meta fields from the document's actual content rather than filling them in by hand.

For the content itself, payload-ai adds AI-assisted generation directly to Text and RichText fields. The practical value for SEO isn't speed, it's consistency: content drafted with your FAQ schema, author fields, and sameAs links already populated gives the structured data something accurate to point at.

Hierarchy from `plugin-nested-docs`

@payloadcms/plugin-nested-docs adds two fields to every document in the collections you enable: parent is a self-referencing relationship (editors pick the parent doc), and breadcrumbs is an auto-populated array of ancestors, each entry with a label and a URL.

Hierarchy is one of the strongest signals AI crawlers use. A page at /docs/getting-started/installation already tells a model what it's about and where it fits - before reading a single line of body text. BreadcrumbList JSON-LD turns that path into something machines can read, and plugin-nested-docs builds it for you: every document already knows its full chain of parent pages, with labels and URLs, ready to drop into the schema below. No URL logic to write by hand, and nothing breaks when an editor renames a parent page.

FAQ and Author/Organization

An FAQPage schema packages questions and answers into a single machine-readable unit that AI bots can lift verbatim - attribution included. Think of it as an inverted pyramid: keep the acceptedAnswer to one or two tight sentences, and leave the deep details in the page body for human readers. Long, vague answers don't get cited; concise, specific ones do. In Payload, a dedicated faqs collection or a simple array field gives editors a clean, structured way to manage this.

Article linked to Author and Organization is how machines read authority. A bio written in the page text does nothing for them. The links have to be set in the markup. The Author is a Person with a sameAs array pointing to verified profiles (LinkedIn, GitHub, ORCID); the publisher is an Organization global edited once with its own sameAs and logo. Without structured authorship, you lose the authority signal - leaving the LLM to just guess whether your content is trustworthy.

Two key rules here. First, render everything server-side: your schema must match the DOM. If you hide your FAQ behind a JS accordion that only mounts on click, your JSON-LD will point to phantom markup. Second, keep Organization as a Payload global - edit it once, inject it everywhere, and prevent data drift when your logo or legal name changes. Validate the final output with Google's Rich Results Test before shipping.

When does llms.txt actually matter?

Mostly worth it for developer documentation. AI agents only fetch llms.txt when explicitly pointed at it - and that traffic is almost entirely coding agents reading API docs in real time. For a marketing site or blog: ~2% adoption, Google ignores it, and inference crawlers barely fetch it on their own.

llms.txt is a proposed plain-text index for AI crawlers that lists your site name, description, and grouped links. The idea is to give models a clean map so they find what matters, but most sites don't actually need it. A lot of companies publish one: Cloudflare, Anthropic, Vercel, Stripe, Supabase, and thousands of others. Directories like llmstxthub.com and directory.llmstxt.cloud list who's doing it.

Look at who actually adopts it, though. Almost every meaningful adopter is a developer-tooling company with documentation that AI coding agents need to read in real time. Cloudflare went so far as to embed a literal warning at the top of every doc page: they publish per-product llms.txt and a llms-full.txt for the entire docs surface. The use case is concrete: a developer asks a coding agent to write something against the Workers API and points it at the docs. Only then does the agent fetch llms.txt to find the right page and pull the Markdown version. That flow works because the agent has a continuous, evolving need for an accurate API surface.

That's not how a marketing site or a blog gets read. Visitors arrive, read, leave. AI search bots - the crawlers that collect training data and answer real-time queries - don't have an ongoing relationship with your content the way a coding agent has with a docs site. They just grab a sample and move on, and the numbers show it:

~2% adoption of llms.txt across analyzed sites
1.1% of llms.txt requests come from OAI-SearchBot in a 30-day audit across 1,000 Adobe domains - the rest is Googlebot
Google doesn't use llms.txt and has no plans to
94% of crawler hits in the most-cited counter-study came from OAI-SearchBot; GPTBot showed up on just 2 of 8 sites

The pragmatic split:

Documentation portal (public API references, SDK guides, technical tutorials) - generate llms.txt and llms-full.txt straight from your CMS collections. The use case is real, and it only takes half a day to ship.
Blog or marketing site - skip the directory maps entirely. Instead, spend that time setting up Markdown content negotiation and cleaning up your HTML hierarchy. Those are the only surfaces AI search bots actually care about.

What Changed, In Numbers?

AI Overviews now cost the top organic result up to 58% of its clicks, GEO-style content lifts visibility by up to 40%, and Markdown cuts the tokens an AI crawler reads by ~80%.

Metric	Finding	Source
CTR drop when AI Overview appears (top organic result)	34.5% (April 2025) → 58% (February 2026)	Ahrefs, 300k keywords
Visibility lift from GEO-style content (citations, stats, authority)	Up to +40% in generative responses	Aggarwal et al., KDD 2024
Token reduction when serving Markdown vs HTML	16,180 → 3,150 tokens (~80% less)	Cloudflare
ChatGPT share of Vercel signups	<1% → 4.8% (Mar 2025) → 10% (Apr 2025)	Rauch on X

Two patterns are worth pulling out:

The optimization vector inverted. The tactics that get you noticed by generative engines - like direct citations, verifiable statistics, and structured authority - are the exact opposite of legacy SEO. In fact, old-school tricks like keyword stuffing actually tanked visibility in recent GEO studies.
Citation share is the new rank. Position #3 vs. #5 on a Google SERP increasingly misses the question that matters: when an AI synthesizes an answer about your category, does it mention you, and how often? Old rank trackers don't measure it.

The Payload setup we've covered - the robots config, Markdown endpoints, and a structured @graph - directly feeds what models actually care about: hierarchy, citable facts, and verifiable authorship. None of this is isolated effort. It all combines into one strong surface. This places your content exactly where LLMs search for answers.

Key Takeaways

On access. AI search bots and training scrapers are different user agents with different jobs, allowing one doesn't mean allowing the other. The split is explicit and configurable.

On content format. Markdown cuts token overhead by ~80% compared to HTML. Serving it via content negotiation costs half a day and makes your content structurally preferred over sites that don't.

On structured data. Three Schema types – BreadcrumbList, FAQPage, and Article linked to Author – cover hierarchy, citation units, and authority. Without structured authorship, the model has no signal to trust your content over anyone else's.

On llms.txt. Only worth implementing if you run a documentation portal. For blogs and marketing sites the fetch rate is ~2%, Google ignores it entirely, and the same effort spent on Markdown endpoints returns more.

On the competitive shift. The tactics that improve GEO visibility: direct citations, verifiable statistics, structured authority — are the opposite of legacy SEO keyword optimization. Optimizing for the old model actively hurts visibility in the new one.

On what's actually being measured now. Traditional rank position is the wrong metric. The question that matters is whether an AI synthesizing an answer in your category cites you, and how often.

AI crawlers are no longer a footnote in your traffic mix. They actively consume your content and provide answers to questions that buyers are already asking. Most sites haven't been adapted yet, which means the structural work covered here — access configuration, Markdown endpoints, linked Schema — still creates a real gap. This isn't a new discipline called "AI SEO." It's the same job content has always had: be readable, be credible, be findable. The audience and the approach to optimization just expanded.

AI-readiness is an architecture decision

The robots config, Markdown endpoints, and structured @graph covered here aren't one-off optimizations — they're decisions baked into how your Payload project is shaped. Getting them right from the start is easier than retrofitting them later, especially as AI crawlers get more selective about what they surface.

If you're building on Payload CMS and want this layer handled properly, FocusReactive specializes in this stack, reach out to hash out the details of your setup.

How to Make Your Payload CMS Site AI-Ready Right Now

TL;DR

How Do You Let AI Crawlers Reach Your Site?

How Do You Expose a Markdown Endpoint for Every Page?

Which Structured Data Do AI Search Bots Actually Read?

Editorial tooling that feeds the same signals

Hierarchy from `plugin-nested-docs`

FAQ and Author/Organization

When does llms.txt actually matter?

What Changed, In Numbers?

Key Takeaways

AI-readiness is an architecture decision

FAQs

More from the Journal

Payload CMS: Scalable Headless CMS for Developers

Custom Admin Panels With Payload CMS

AI-readiness is an architecture decision.

How to Make Your Payload CMS Site AI-Ready Right Now

TL;DR

How Do You Let AI Crawlers Reach Your Site?

How Do You Expose a Markdown Endpoint for Every Page?

Which Structured Data Do AI Search Bots Actually Read?

Editorial tooling that feeds the same signals

Hierarchy from plugin-nested-docs

FAQ and Author/Organization

When does llms.txt actually matter?

What Changed, In Numbers?

Key Takeaways

AI-readiness is an architecture decision

FAQs

Should I block AI crawlers to protect my content?

Why serve a Markdown version of my pages instead of just HTML?

Which structured data should I add for AI search?

Do I need an `llms.txt` file?

Does any of this help if I already rank well on Google?

More from the Journal

Payload CMS: Scalable Headless CMS for Developers

Custom Admin Panels With Payload CMS

AI-readiness is an architecture decision.

Hierarchy from `plugin-nested-docs`