seo404docsai-agents

AI Hallucinated Links: Building a Semantic 404 Resolver for AI Agents

Sandbox0 Team·

AI agents do not just crawl the links you publish. They also predict the links they expect to exist.

That is how you end up with requests like /docs/self-hosting, /docs/deploy/single-cluster, or /docs/quickstart even when your actual documentation lives at different canonical URLs. The intent is usually correct. The path is not. If your site responds with a bare 404, that visit ends there.

This problem is now large enough to matter. In a September 2, 2025 Ahrefs study, AI assistants sent visitors to 404 pages 2.87x more often than Google Search. Ahrefs analyzed 16 million unique cited URLs, and found that ChatGPT had the highest observed 404 rate among the tested assistants. Earlier, in June 2024, Nieman Lab documented ChatGPT generating plausible but incorrect article URLs for major publishers.

If you publish docs, blogs, or product pages for AI users, this is no longer a theoretical edge case. It is a new class of broken referral traffic.

AI hallucinated links are URLs that look structurally correct but do not actually exist on the destination site.

They are often close to the real URL:

  • /docs/self-hosting instead of /docs/self-hosted
  • /docs/deploy/single-cluster instead of /docs/self-hosted/install
  • /blog/books3-dataset-ai-copyright-infringement instead of the publisher's real slug

The model is not making up a random string. It is usually predicting the most likely path from the site's wording, slug patterns, navigation labels, and prior web examples.

That is why this issue is different from ordinary broken backlinks:

  • The request often expresses real intent.
  • The incorrect URL may never have existed before.
  • The space of possible wrong URLs is effectively unbounded.

Why Static Redirect Tables Do Not Scale#

The first instinct is to add redirects for common mistakes. That works for a few obvious cases, but it breaks down quickly.

Why:

  • You cannot enumerate all plausible variants of every docs path.
  • New pages create new wrong guesses automatically.
  • Different agents hallucinate different URL patterns.
  • The same page can have multiple near-miss forms: singular vs plural, guide vs guides, deploy vs installation, self-hosting vs self-hosted.

A redirect table is still useful for hard legacy migrations. It is the wrong primary mechanism for hallucinated URLs. The core problem is intent resolution, not URL alias maintenance.

What Is a Semantic 404 Resolver?#

A semantic 404 resolver is a 404 handler that tries to infer the most likely canonical page from the requested path.

The important word is semantic. The resolver should not only compare literal strings. It should infer that:

  • self-hosting is close to self-hosted
  • deploy single cluster is close to install self-hosted
  • quickstart is often equivalent to get started

The goal is not to redirect every unknown path. The goal is to recover the obvious ones with high confidence and keep the rest as true 404s with better suggestions.

That distinction matters for SEO. Google has warned for years against soft 404 behavior. If a page does not exist, it should not quietly return 200 OK with unrelated content. Google explicitly recommends returning a real 404 or 410, only redirecting when the URL should map to a more accurate destination, and making 404 pages helpful for users with search and navigation hints. Google on soft 404s Google on useful 404 pages

Our Approach: Local Similarity, Not Alias Exhaustion#

We are building this as a local similarity system instead of a giant redirect list.

That means:

  • no external embedding API call on every 404
  • no online vector database lookup just to resolve one docs request
  • no dependency on a third-party ranking service in the request path

For a docs site, that tradeoff is good. The page universe is small enough to index at build time, and the 404 path should stay cheap, deterministic, and low-latency.

How a Local Semantic 404 Resolver Works#

The architecture is straightforward.

1. Build a Canonical Page Index#

At build time, generate an index of the pages you are willing to recommend or redirect to.

For each page, store:

  • canonical URL
  • title
  • short summary
  • section or breadcrumb
  • normalized slug tokens
  • optional synonym tokens

For a docs site, this data usually already exists in the docs manifest, frontmatter, or sidebar definition.

Example record:

FieldExample
URL/docs/latest/self-hosted/install
TitleInstall
SectionSelf-Hosted
Tokensself, hosted, install, deployment, single, cluster
SummaryInstall Sandbox0 on Kubernetes and apply a Sandbox0Infra sample.

2. Normalize the Incoming 404 Path#

Turn the requested path into a comparable representation.

Example:

text
/docs/deploy/single-cluster

becomes:

text
["docs", "deploy", "single", "cluster"]

Typical normalization steps:

  • lowercase
  • URL decode
  • strip .html
  • split on /, -, _
  • remove stopwords like the, a, page
  • singularize or stem common variants where safe
  • map known synonyms like quickstart -> get-started

This is where most of the practical recall comes from. Many hallucinated URLs are not semantically mysterious. They are simple near-misses.

3. Score Candidate Pages Locally#

Once both sides are normalized, rank the canonical pages.

A strong local scoring pipeline usually combines several signals:

SignalWhy it helps
Token overlapCatches obvious intent matches
Edit distanceHandles self-hosting vs self-hosted
Trigram similarityHandles slug phrasing differences
Section boostHelps docs paths prefer docs pages in the same topic area
Title match boostRewards exact or near-exact concept matches
Synonym boostBridges deploy and install, quickstart and get started

You do not need a neural model to get good results on this kind of problem. For many technical documentation sites, a weighted combination of lexical and synonym-aware similarity is enough.

4. Redirect Only on High Confidence#

This is the part many implementations get wrong.

If the top candidate is only weakly better than the next one, do not auto-redirect. Return a real 404 and show suggested pages instead.

Good automatic redirect conditions usually look like:

  • top score exceeds a fixed threshold
  • top result beats the second result by a clear margin
  • candidate page type matches the request area

For example:

  • /docs/... should usually resolve to docs first
  • /blog/... should usually resolve to blog first
  • a random unknown marketing slug should not redirect to the docs homepage

5. Keep a Helpful 404 for Everything Else#

A semantic resolver should improve your 404 page, not replace it.

If there is no confident match, keep the response as a true 404 Not Found and show:

  • the top suggested pages
  • a site search box
  • a link to docs home
  • a link to the parent path, if useful
  • a link to the sitemap or top docs sections

That follows Google's long-standing guidance for helpful 404s while preserving correct crawl semantics. Google's guidance

Why We Prefer Local Similarity Over Remote Embeddings#

You can build a semantic 404 resolver with embeddings and vector search. For larger sites, that may be the right answer.

We are starting with local similarity for simpler reasons:

  • the candidate set is small
  • build-time indexing is easy
  • runtime latency is predictable
  • debugging scores is much easier
  • self-hosted and edge-friendly deployments stay simple

For technical docs, interpretability matters. If /docs/deploy/single-cluster resolves to /docs/latest/self-hosted/install, we want to know exactly why: token overlap, synonym map, and section weighting. A black-box semantic score is harder to trust when the redirect is wrong.

How This Fits on a Static Docs Site#

Our website is a static export, which means the request-time resolver should live at the edge, not inside the docs build itself.

Cloudflare Pages Functions are a good fit for this. The platform lets a Function fall through to the static asset server and also fetch static assets directly. That means the request path can be:

  1. try the normal static file
  2. if it exists, return it immediately
  3. if it is a 404, run local similarity scoring
  4. if there is a high-confidence match, redirect
  5. otherwise return a real 404 page with suggestions

Cloudflare also supports route scoping with _routes.json, so the dynamic resolver can be limited to areas like /docs/* instead of every request on the site. Cloudflare Pages routing Cloudflare Pages API reference

Semantic 404 Resolver vs Smart 404 vs Soft 404#

These terms are easy to mix up.

ApproachBehaviorGood idea?
Soft 404Returns 200 OK for a missing pageNo
Smart 404 pageReturns 404, but shows useful suggestionsYes
Semantic 404 resolverReturns 404 unless a high-confidence canonical match exists; otherwise redirectsYes

The key is that a semantic 404 resolver should still respect HTTP truth. It is a recovery layer, not an excuse to hide missing pages.

A Practical Decision Rule#

If you are implementing this now, start with a conservative policy:

  • redirect only when there is one obvious best match
  • return 404 with suggestions when confidence is ambiguous
  • log every unresolved path for future analysis

This is how the system gets better without turning into an unmaintainable alias list.

The logs tell you:

  • which hallucinated concepts appear repeatedly
  • which synonym mappings are missing
  • whether a new docs section needs clearer naming
  • whether a high-value page deserves a shorter canonical slug

Does llms.txt Solve This?#

No.

llms.txt, clean sitemaps, strong internal links, and clear docs navigation all help agents discover the right entry points. They reduce upstream ambiguity. They do not solve the downstream problem once an agent has already requested a plausible but incorrect path.

You want both:

  • better machine-readable navigation to reduce wrong guesses
  • a semantic 404 resolver to recover from the guesses that still happen

FAQ#

They are real, but they should be framed correctly.

For many sites today, AI referral traffic is still small in absolute terms. But the traffic is often high intent, and the broken-link rate is materially higher than normal search traffic. If the visitor arrives from an agent during evaluation or research, losing that session on a bare 404 is avoidable waste.

Should every unknown URL redirect somewhere?#

No.

Only redirect when the destination is clearly the page the visitor meant to reach. Otherwise return a real 404 with suggestions. Redirecting every bad URL to a generic page is exactly the kind of soft 404 behavior Google warns against.

Do you need embeddings to build a semantic 404 resolver?#

Not always.

For many docs sites, local tokenization, synonym expansion, edit distance, and weighted lexical similarity are enough. Start simple. Add embeddings only when the page set or ambiguity level justifies it.

What should I index first?#

Start with the pages that matter most:

  • docs pages
  • product pages
  • high-traffic blog posts
  • comparison or alternative pages

If the candidate set is too broad on day one, quality drops. A smaller, high-quality index is easier to tune.

Final Takeaway#

AI hallucinated links are now part of website infrastructure, not just model behavior.

If your site serves AI users, a plain 404 is no longer enough. But a giant redirect table is not enough either.

The better pattern is:

  1. keep correct HTTP semantics
  2. rank likely destinations locally
  3. redirect only on high confidence
  4. make the remaining 404s genuinely helpful

That is what a semantic 404 resolver should do.

If you are building AI-facing docs or product pages, that work is quickly becoming part of the baseline.

If you want to see the kind of docs structure we optimize for, start with the Sandbox0 docs or browse the repository on GitHub.