AI Hallucinated Links: Building a Semantic 404 Resolver for AI Agents
AI agents do not just crawl the links you publish. They also predict the links they expect to exist.
That is how you end up with requests like /docs/self-hosting, /docs/deploy/single-cluster, or /docs/quickstart even when your actual documentation lives at different canonical URLs. The intent is usually correct. The path is not. If your site responds with a bare 404, that visit ends there.
This problem is now large enough to matter. In a September 2, 2025 Ahrefs study, AI assistants sent visitors to 404 pages 2.87x more often than Google Search. Ahrefs analyzed 16 million unique cited URLs, and found that ChatGPT had the highest observed 404 rate among the tested assistants. Earlier, in June 2024, Nieman Lab documented ChatGPT generating plausible but incorrect article URLs for major publishers.
If you publish docs, blogs, or product pages for AI users, this is no longer a theoretical edge case. It is a new class of broken referral traffic.
What Are AI Hallucinated Links?#
AI hallucinated links are URLs that look structurally correct but do not actually exist on the destination site.
They are often close to the real URL:
/docs/self-hostinginstead of/docs/self-hosted/docs/deploy/single-clusterinstead of/docs/self-hosted/install/blog/books3-dataset-ai-copyright-infringementinstead of the publisher's real slug
The model is not making up a random string. It is usually predicting the most likely path from the site's wording, slug patterns, navigation labels, and prior web examples.
That is why this issue is different from ordinary broken backlinks:
- The request often expresses real intent.
- The incorrect URL may never have existed before.
- The space of possible wrong URLs is effectively unbounded.
Why Static Redirect Tables Do Not Scale#
The first instinct is to add redirects for common mistakes. That works for a few obvious cases, but it breaks down quickly.
Why:
- You cannot enumerate all plausible variants of every docs path.
- New pages create new wrong guesses automatically.
- Different agents hallucinate different URL patterns.
- The same page can have multiple near-miss forms: singular vs plural,
guidevsguides,deployvsinstallation,self-hostingvsself-hosted.
A redirect table is still useful for hard legacy migrations. It is the wrong primary mechanism for hallucinated URLs. The core problem is intent resolution, not URL alias maintenance.
What Is a Semantic 404 Resolver?#
A semantic 404 resolver is a 404 handler that tries to infer the most likely canonical page from the requested path.
The important word is semantic. The resolver should not only compare literal strings. It should infer that:
self-hostingis close toself-hosteddeploy single clusteris close toinstall self-hostedquickstartis often equivalent toget started
The goal is not to redirect every unknown path. The goal is to recover the obvious ones with high confidence and keep the rest as true 404s with better suggestions.
That distinction matters for SEO. Google has warned for years against soft 404 behavior. If a page does not exist, it should not quietly return 200 OK with unrelated content. Google explicitly recommends returning a real 404 or 410, only redirecting when the URL should map to a more accurate destination, and making 404 pages helpful for users with search and navigation hints. Google on soft 404s Google on useful 404 pages
Our Approach: Local Similarity, Not Alias Exhaustion#
We are building this as a local similarity system instead of a giant redirect list.
That means:
- no external embedding API call on every 404
- no online vector database lookup just to resolve one docs request
- no dependency on a third-party ranking service in the request path
For a docs site, that tradeoff is good. The page universe is small enough to index at build time, and the 404 path should stay cheap, deterministic, and low-latency.
How a Local Semantic 404 Resolver Works#
The architecture is straightforward.
1. Build a Canonical Page Index#
At build time, generate an index of the pages you are willing to recommend or redirect to.
For each page, store:
- canonical URL
- title
- short summary
- section or breadcrumb
- normalized slug tokens
- optional synonym tokens
For a docs site, this data usually already exists in the docs manifest, frontmatter, or sidebar definition.
Example record:
| Field | Example |
|---|---|
| URL | /docs/latest/self-hosted/install |
| Title | Install |
| Section | Self-Hosted |
| Tokens | self, hosted, install, deployment, single, cluster |
| Summary | Install Sandbox0 on Kubernetes and apply a Sandbox0Infra sample. |
2. Normalize the Incoming 404 Path#
Turn the requested path into a comparable representation.
Example:
text/docs/deploy/single-cluster
becomes:
text["docs", "deploy", "single", "cluster"]
Typical normalization steps:
- lowercase
- URL decode
- strip
.html - split on
/,-,_ - remove stopwords like
the,a,page - singularize or stem common variants where safe
- map known synonyms like
quickstart -> get-started
This is where most of the practical recall comes from. Many hallucinated URLs are not semantically mysterious. They are simple near-misses.
3. Score Candidate Pages Locally#
Once both sides are normalized, rank the canonical pages.
A strong local scoring pipeline usually combines several signals:
| Signal | Why it helps |
|---|---|
| Token overlap | Catches obvious intent matches |
| Edit distance | Handles self-hosting vs self-hosted |
| Trigram similarity | Handles slug phrasing differences |
| Section boost | Helps docs paths prefer docs pages in the same topic area |
| Title match boost | Rewards exact or near-exact concept matches |
| Synonym boost | Bridges deploy and install, quickstart and get started |
You do not need a neural model to get good results on this kind of problem. For many technical documentation sites, a weighted combination of lexical and synonym-aware similarity is enough.
4. Redirect Only on High Confidence#
This is the part many implementations get wrong.
If the top candidate is only weakly better than the next one, do not auto-redirect. Return a real 404 and show suggested pages instead.
Good automatic redirect conditions usually look like:
- top score exceeds a fixed threshold
- top result beats the second result by a clear margin
- candidate page type matches the request area
For example:
/docs/...should usually resolve to docs first/blog/...should usually resolve to blog first- a random unknown marketing slug should not redirect to the docs homepage
5. Keep a Helpful 404 for Everything Else#
A semantic resolver should improve your 404 page, not replace it.
If there is no confident match, keep the response as a true 404 Not Found and show:
- the top suggested pages
- a site search box
- a link to docs home
- a link to the parent path, if useful
- a link to the sitemap or top docs sections
That follows Google's long-standing guidance for helpful 404s while preserving correct crawl semantics. Google's guidance
Why We Prefer Local Similarity Over Remote Embeddings#
You can build a semantic 404 resolver with embeddings and vector search. For larger sites, that may be the right answer.
We are starting with local similarity for simpler reasons:
- the candidate set is small
- build-time indexing is easy
- runtime latency is predictable
- debugging scores is much easier
- self-hosted and edge-friendly deployments stay simple
For technical docs, interpretability matters. If /docs/deploy/single-cluster resolves to /docs/latest/self-hosted/install, we want to know exactly why: token overlap, synonym map, and section weighting. A black-box semantic score is harder to trust when the redirect is wrong.
How This Fits on a Static Docs Site#
Our website is a static export, which means the request-time resolver should live at the edge, not inside the docs build itself.
Cloudflare Pages Functions are a good fit for this. The platform lets a Function fall through to the static asset server and also fetch static assets directly. That means the request path can be:
- try the normal static file
- if it exists, return it immediately
- if it is a 404, run local similarity scoring
- if there is a high-confidence match, redirect
- otherwise return a real 404 page with suggestions
Cloudflare also supports route scoping with _routes.json, so the dynamic resolver can be limited to areas like /docs/* instead of every request on the site. Cloudflare Pages routing Cloudflare Pages API reference
Semantic 404 Resolver vs Smart 404 vs Soft 404#
These terms are easy to mix up.
| Approach | Behavior | Good idea? |
|---|---|---|
| Soft 404 | Returns 200 OK for a missing page | No |
| Smart 404 page | Returns 404, but shows useful suggestions | Yes |
| Semantic 404 resolver | Returns 404 unless a high-confidence canonical match exists; otherwise redirects | Yes |
The key is that a semantic 404 resolver should still respect HTTP truth. It is a recovery layer, not an excuse to hide missing pages.
A Practical Decision Rule#
If you are implementing this now, start with a conservative policy:
- redirect only when there is one obvious best match
- return 404 with suggestions when confidence is ambiguous
- log every unresolved path for future analysis
This is how the system gets better without turning into an unmaintainable alias list.
The logs tell you:
- which hallucinated concepts appear repeatedly
- which synonym mappings are missing
- whether a new docs section needs clearer naming
- whether a high-value page deserves a shorter canonical slug
Does llms.txt Solve This?#
No.
llms.txt, clean sitemaps, strong internal links, and clear docs navigation all help agents discover the right entry points. They reduce upstream ambiguity. They do not solve the downstream problem once an agent has already requested a plausible but incorrect path.
You want both:
- better machine-readable navigation to reduce wrong guesses
- a semantic 404 resolver to recover from the guesses that still happen
FAQ#
Are AI hallucinated links a real SEO problem or just noise?#
They are real, but they should be framed correctly.
For many sites today, AI referral traffic is still small in absolute terms. But the traffic is often high intent, and the broken-link rate is materially higher than normal search traffic. If the visitor arrives from an agent during evaluation or research, losing that session on a bare 404 is avoidable waste.
Should every unknown URL redirect somewhere?#
No.
Only redirect when the destination is clearly the page the visitor meant to reach. Otherwise return a real 404 with suggestions. Redirecting every bad URL to a generic page is exactly the kind of soft 404 behavior Google warns against.
Do you need embeddings to build a semantic 404 resolver?#
Not always.
For many docs sites, local tokenization, synonym expansion, edit distance, and weighted lexical similarity are enough. Start simple. Add embeddings only when the page set or ambiguity level justifies it.
What should I index first?#
Start with the pages that matter most:
- docs pages
- product pages
- high-traffic blog posts
- comparison or alternative pages
If the candidate set is too broad on day one, quality drops. A smaller, high-quality index is easier to tune.
Final Takeaway#
AI hallucinated links are now part of website infrastructure, not just model behavior.
If your site serves AI users, a plain 404 is no longer enough. But a giant redirect table is not enough either.
The better pattern is:
- keep correct HTTP semantics
- rank likely destinations locally
- redirect only on high confidence
- make the remaining 404s genuinely helpful
That is what a semantic 404 resolver should do.
If you are building AI-facing docs or product pages, that work is quickly becoming part of the baseline.
If you want to see the kind of docs structure we optimize for, start with the Sandbox0 docs or browse the repository on GitHub.