Article Information

Category: Development

Published: 12 April, 2026

Author: Chris de Gruijter

Reading Time: 14 min

Making Your Website Crawlable and Citable by LLMs Using llms.txt

Published: 12 April, 2026

Most websites are effectively invisible to AI language models. ChatGPT, Claude, and Perplexity are increasingly the place people go to find answers — but if your site is a JavaScript SPA with no structured content surface, the LLM either can't read it, misquotes it, or skips it entirely. This matters more than it used to. AI citations are a real traffic channel now, and they reward sites that make clean, well-structured content easy for machines to consume. The llms.txt standard — modelled loosely on robots.txt — is gaining traction as the practical way to do this. I've now implemented it across two production client sites and built out a full build-time generator for Next.js. Here's exactly how the whole thing works.

Why LLMs Struggle with Modern Websites

Google's crawler is patient and sophisticated. It renders JavaScript, follows redirects, resolves dynamic imports, and scores pages on Core Web Vitals. LLM crawlers are not like that. They prefer fast, clean HTTP responses with real text content. A Next.js or Nuxt app that client-renders its main content — or that buries page text inside deeply nested component trees — is going to get a shallow or empty parse.

Even with SSR, the problem isn't fully solved. AI crawlers are looking for structured, readable context: what is this site about, what are its main sections, what does each page say. An HTML page with global nav, cookie banner, footer, and then three paragraphs of actual content is noisy. A markdown file with the same three paragraphs is perfect.

The other problem is discoverability. LLMs building their training data or doing live web search need a way to discover what's on your site without crawling every URL. That's exactly what llms.txt is for.

The Three Artifacts You Need

The llms.txt spec calls for three deliverables. Each solves a different part of the discoverability problem:

1. /llms.txt — The Site Index for AI Crawlers

Think of this as robots.txt for LLMs — except instead of blocking crawlers, it's actively helping them. The file groups your pages by content type and links to the .md version of each page. A typical section looks like this:

# GevelPRO — Gevelrenovatie specialist

## Services
- [Voegwerk](https://gevelpro.nl/voegwerk.md): Professional pointing and repointing services
- [Gevelreiniging](https://gevelpro.nl/reiniging.md): Facade cleaning and restoration

## Blog
- [5 praktische tips voor gevelonderhoud](https://gevelpro.nl/nieuws/5-praktische-tips-voor-gevelonderhoud.md)

## Locations
- [Amsterdam](https://gevelpro.nl/locaties/amsterdam.md)
- [Rotterdam](https://gevelpro.nl/locaties/rotterdam.md)

The format is plain markdown. Each link points to the .md route of the page, not the HTML version. The idea is that an AI agent that wants to understand your site can read this file in seconds and then follow links to get clean content for any page it cares about.

2. /llms-full.txt — The Bulk Corpus File

Some AI agents — particularly those doing bulk ingestion or building RAG indexes — prefer to download a single document containing your full site content. llms-full.txt is that document. It's intentionally curated, not a dump of everything:

Homepage content — who you are, what you do
Core service pages — the full description of each offering
Selected blog posts — your most useful evergreen content
Portfolio or case study highlights

I don't include location pages (too repetitive across a city portfolio), thin pages, or anything that's more template than content. The goal is to give an LLM a dense, accurate, high-signal picture of the site in one pass.

3. Page-Level .md Routes

This is the part that makes the whole system work as a network rather than just a static document. Every canonical page on your site gets a .md equivalent at the same path:

/voegwerk → /voegwerk.md
/nieuws/some-post-slug → /nieuws/some-post-slug.md
/locaties/amsterdam → /locaties/amsterdam.md

Each markdown file starts with metadata: canonical HTML URL, page type, last-modified date. Then it's clean prose content — no nav, no footer, no cookie consent noise. Internal links inside the markdown point to the .md version of linked pages, not the HTML. This means an AI agent can follow a link from one page's markdown into another's, crawling your site's content graph without touching HTML at all.

<!-- canonical: https://gevelpro.nl/voegwerk -->
<!-- type: service -->
<!-- last-modified: 2026-03-15 -->

# Voegwerk — Professioneel voegen en hervoegen

Voegwerk is het vullen en afwerken van de naden tussen stenen, blokken of tegels in een gevel,
vloer of muur. Het beschermt de constructie tegen vocht, vorst en scheurvorming.

## Onze aanpak

We werken uitsluitend met dampopen voegmortel die ademende gevels ondersteunt...

[Bekijk onze locaties](/locaties.md)

The Next.js Approach: Build-Time Generator

For Next.js sites on Cloudflare Pages (free plan), there's no runtime — no Cloudflare Worker, no API routes that run server-side in production. Everything has to be static. That turns out to be a feature, not a constraint: generating all LLM artifacts at build time and writing them to public/ means they're served as static files from the CDN edge globally, with zero cost and zero latency.

Architecture: Route-First, Manifest-Driven

The key insight that makes the generator maintainable is building around a canonical route manifest. Instead of scanning the file system or parsing built HTML, you maintain one source of truth that every other output consumes:

// lib/llms/route-manifest.ts
export const routeManifest: RouteEntry[] = [
  {
    path: '/',
    type: 'core',
    title: 'GevelPRO — Gevelrenovatie specialist',
    lastModified: '2026-03-01',
    mdOutput: 'public/index.md',
  },
  {
    path: '/voegwerk',
    type: 'service',
    title: 'Voegwerk',
    lastModified: '2026-03-15',
    mdOutput: 'public/voegwerk.md',
  },
  // ... all canonical routes
];

Both the sitemap generator and the LLM generator consume this same manifest. They can never drift apart — if a page exists in the sitemap, it has an LLM artifact, and vice versa. Adding a new page means one entry in the manifest, and both outputs are updated on the next build.

Folder Structure

The generator lives in two places: core logic in lib/llms/, and the CLI runner in scripts/llms/:

lib/llms/
  types.ts            # RouteEntry, ContentExtractor, GeneratorConfig
  config.ts           # Site metadata, base URL, exclusions
  route-manifest.ts   # The canonical list of all public routes
  validators.ts       # Schema checks for manifest entries
  generator.ts        # Orchestrator — reads manifest, calls extractors, writes files
  renderers/
    page-markdown.ts  # Renders a single .md file for a page
    llms-index.ts     # Renders /llms.txt from the full manifest
    llms-full.ts      # Renders /llms-full.txt from curated entries
  extractors/
    core.ts           # Homepage, about, contact content
    services.ts       # Service page content from your data files
    blog.ts           # Blog post content
    portfolio.ts      # Portfolio/case study content
    cities.ts         # City/location pages

scripts/llms/
  generate.ts         # CLI entry point — imports generator, runs it
  verify.ts           # Post-generation QA: checks file counts, validates URLs

Content Extractors: Pull from Data, Not HTML

Each extractor knows how to pull content for its content type from your existing structured data files — not from scanning built HTML. If your service pages are driven by a data/services.ts file, the services extractor imports that directly and renders clean markdown. This means the generated content is always consistent with what the site actually says, and there's no dependency on a running dev server or built output:

// lib/llms/extractors/services.ts
import { services } from '~/data/services';

export function extractServiceContent(path: string): string {
  const slug = path.replace('/', '');
  const service = services.find((s) => s.slug === slug);

  if (!service) return '';

  const lines: string[] = [
    `# ${service.title}`,
    '',
    service.intro,
    '',
  ];

  for (const section of service.sections) {
    lines.push(`## ${section.heading}`);
    lines.push('');
    lines.push(section.body);
    lines.push('');
  }

  return lines.join('\n');
}

Stale File Cleanup

One problem that bites you quickly: if you remove a route from the manifest, the old .md file still sits in public/ and gets deployed. You need a cleanup mechanism that deletes only files this generator created, without touching other assets in public/.

The solution is a small tracking file: public/.llms-generated.json. On each run, the generator reads the list of previously-generated file paths, deletes those files, then writes fresh ones and updates the tracking file. Unrelated assets like images and fonts are never touched.

// lib/llms/generator.ts (simplified)
const TRACKING_FILE = 'public/.llms-generated.json';

async function cleanupPreviousGeneration(): Promise<void> {
  const tracked = JSON.parse(await fs.readFile(TRACKING_FILE, 'utf8').catch(() => '[]'));
  for (const filePath of tracked) {
    await fs.unlink(filePath).catch(() => null);
  }
}

async function trackGeneratedFiles(files: string[]): Promise<void> {
  await fs.writeFile(TRACKING_FILE, JSON.stringify(files, null, 2));
}

Build Integration

The generator runs before next build. On Cloudflare Pages, prebuild hooks do not run — Cloudflare calls your build command directly. So the generation step has to be baked into the build script itself, not a pre-hook:

{
  "scripts": {
    "build": "npm run llms:generate && next build",
    "llms:generate": "node --import tsx ./scripts/llms/generate.ts",
    "llms:verify": "node --import tsx ./scripts/llms/verify.ts",
    "test:llms": "node --import tsx --test tests/llms-manifest.test.ts tests/llms-generation.test.ts"
  }
}

The verify script runs post-generation and checks that the number of .md files matches the number of manifest entries, that all internal links resolve within the manifest, and that llms.txt and llms-full.txt were both written. Running it in CI catches issues before they reach production.

What to Include and What to Exclude

The manifest should only contain canonical, indexable pages. A few hard rules I apply:

Include: all pages you'd include in a sitemap. Service pages, blog posts, location pages, core pages (about, contact, homepage).
Exclude: noindex pages, legacy redirect routes, admin or API routes, paginated duplicates (page 2, page 3 of a category).
City + service combos: if your site has /locaties/amsterdam/voegwerk as a real HTML page, include it. If it doesn't exist as a real page, don't invent a .md file for it. LLMs trust your URLs — don't create a discrepancy between what the markdown implies exists and what actually loads in a browser.

Verified Numbers from Production

I've shipped this generator on two client sites. Here's what the output looks like at scale:

GevelPRO (gevelpro.nl): 138 canonical routes in the manifest, 140 generated files (138 .md pages + llms.txt + llms-full.txt, minus the tracking file)
Overbeek (voegersbedrijfoverbeek.nl): 92 canonical routes, 94 generated artifacts

Build time overhead is under 4 seconds for either site. The generated files are committed to public/ or generated fresh on each Cloudflare Pages build — either approach works. I prefer generating fresh each build so the last-modified dates stay accurate.

The Nuxt Approach: nuxt-ai-ready Module

For Nuxt projects, there's a much simpler path. The nuxt-ai-ready module (version ^0.10.10) handles the full llms.txt pipeline automatically — including markdown route generation, IndexNow submission, and an MCP server for AI agents with full-text search over your content.

Installation

npx nuxi module add nuxt-ai-ready

Configuration

The module requires a Cloudflare D1 database for persistent page indexing and FTS5 full-text search. Wire it up in your nuxt.config.ts and wrangler.toml:

// nuxt.config.ts
export default defineNuxtConfig({
  modules: ['nuxt-ai-ready'],
  aiReady: {
    database: {
      type: 'd1',
      // binding name must match your wrangler.toml
    },
  },
});

# wrangler.toml
[[d1_databases]]
binding = "AI_READY_DB"
database_name = "myclient-ai-ready-db"
database_id = "your-d1-database-id"

The D1 database is what enables persistent indexing and search. Without it, the module can still generate static files, but the MCP server and FTS5 search won't work. For most client sites on Cloudflare, creating a D1 database is a one-minute operation in the Cloudflare dashboard.

One Gotcha: Prerender Errors for .md Routes

When Nitro prerenders your Nuxt site, it will attempt to crawl the .md routes that the module generates — and may throw prerender errors if those aren't explicitly handled. Add them to your Nitro ignore patterns if you hit this:

// nuxt.config.ts
export default defineNuxtConfig({
  nitro: {
    prerender: {
      ignore: ['/**/*.md'],
    },
  },
});

The module is still relatively new and the prerender integration may improve in future versions. Check the module's GitHub repo for the latest guidance.

Why This Is Worth Doing Now

I want to be honest about where the llms.txt standard is right now: it's gaining traction, it's not universal. Major AI systems don't all formally support it yet — there's no RFC, no W3C endorsement, no guarantee that Perplexity's crawler specifically looks for /llms.txt the way Googlebot looks for /robots.txt. The spec is maintained independently by the people who proposed it, and adoption is organic.

That said, the underlying logic is sound regardless of standard adoption. Clean markdown files served as static assets are trivially cheap to generate and maintain. An LLM that hits your /voegwerk.md URL — whether following an llms.txt link or just discovering the URL through other means — is going to get a dramatically better read of your page than it would from the full HTML. The format is self-reinforcing: even if llms.txt itself is ignored, the .md routes are independently useful.

The AI citation channel is real. I've seen Perplexity cite client sites with specific quotes pulled from service pages. Those quotes are accurate when the markdown is clean and inaccurate when it's being parsed from noisy HTML. The investment in clean content surfaces pays off.

Sites that give LLMs clean, structured, markdown-formatted content get cited more accurately — and eventually, more often.
— The working hypothesis behind this whole system

How to Start

If you're on Nuxt: install nuxt-ai-ready, create a D1 database, wire up the config. You can be done in an hour.

If you're on Next.js with Cloudflare Pages: start with the manifest. Sit down and list every canonical URL on your site with its content type, title, and last-modified date. That manifest is the hard part — once it exists, the generator can be built around it incrementally. Start with just llms.txt and a handful of core .md pages, then expand from there.

Either way: run your verify script, deploy once, and check that https://yourdomain.com/llms.txt and https://yourdomain.com/llms-full.txt are publicly accessible and return sensible content. That's the whole thing. It's a two-hour project for most sites, and the static files cost nothing to serve.

The window where this is differentiating is now, while most sites haven't done it. In two years, having llms.txt will be table stakes — the same way having a sitemap is today. Might as well be early.

Frequently Asked Questions

Is llms.txt an official standard?

Not yet. It's an emerging convention maintained by its original proposers, not an RFC or W3C spec. Major AI systems are starting to recognise it, but adoption varies. The underlying approach — exposing clean markdown content at consistent URLs — is independently useful regardless of whether any specific crawler formally supports the llms.txt index format.

Do I need to expose all my pages as markdown, or just the important ones?

All canonical, indexable pages should have a .md equivalent — meaning any page you'd include in your XML sitemap. Thin pages, noindex pages, paginated duplicates, and admin routes should be excluded. The goal is parity between your sitemap and your LLM content surface.

What is the difference between llms.txt and llms-full.txt?

llms.txt is a compact index — it lists your pages grouped by type and links to their .md routes. It's the entry point for an AI agent that wants to explore your site selectively. llms-full.txt is a curated single-file corpus for agents that want to bulk-ingest your content. You curate llms-full.txt manually — it contains the full text of your most important pages concatenated together, not an exhaustive dump of every URL.

Why do internal links inside .md files point to .md URLs instead of HTML?

Because an AI agent following links from within a markdown file should land in more markdown — not in a full HTML page it has to re-parse. By linking to .md equivalents, you build a content graph that an agent can traverse entirely in clean text. It makes the whole network of pages machine-traversable without any HTML parsing.

Does this approach work on Cloudflare Pages free plan?

Yes, that's exactly the deployment target for the Next.js build-time approach. All LLM artifacts are generated before next build runs and written to public/, so Cloudflare Pages serves them as static files. No Workers, no Cloudflare Pro features required. The Nuxt approach with nuxt-ai-ready does require Cloudflare D1 for the full feature set, but D1 is available on the free plan.

How do I handle city pages that share a service template?

Only generate .md files for pages that actually exist as real HTML routes. If /locaties/amsterdam/voegwerk is a real page with its own URL, include it in the manifest. If your site has a single /locaties/amsterdam page that lists all services, generate one .md file for that URL and embed the service sections within it. Never invent .md URLs that don't correspond to real HTML pages — it creates a trust problem if an LLM tries to follow a link and gets a 404.

Will this help with ChatGPT, Claude, and Perplexity specifically?

It depends on how each system crawls. Perplexity actively browses the web during search. ChatGPT with browsing enabled follows links. Claude with web search does the same. All of these benefit from clean markdown routes even without a formal llms.txt lookup — they'll get better content when they do land on your .md URLs. The llms.txt index is an additional discovery aid for systems that specifically check it.

Previous Article Next Article