The Web You'll Never Read

A website serving Windows Server 2019 was returning the correct pages to every human who visited, to Bing, and to Yandex, and Thai casino spam to Googlebot alone. The admin who posted about it had spent days inside the box looking for the cause. They checked web.config, the URL rewrite rules, the HTTP redirects, the custom error pages, the IIS handlers and modules and ISAPI filters, robots.txt, the sitemap, and then ran findstr across every drive for the spam domain and the casino keywords. All of it came back clean. The same behaviour showed up on multiple unrelated sites on the same server, which ruled out a single compromised application and pointed at something sitting underneath all of them.

The thing they couldn’t find wasn’t on disk at all, it was a native IIS module, a compiled C++ DLL registered at the server level, and it fetched the spam page from a remote server at the moment Googlebot asked for it. That’s why findstr returned nothing: there was no spam file to find, no malicious string in any application folder, because the payload only existed in memory for the length of a request and only when the request came from the right reader. ESET documented this exact pattern in September 2025 as part of a campaign they named GhostRedirector, which had compromised at least 65 Windows servers, heavily in Brazil, Thailand, and Vietnam. The module they found, Gamshen, checked whether a request carried Googlebot’s user agent or a google.com referer, and only then swapped the response for content pulled from its own command-and-control server. A regular visitor never triggered it, which was the whole design.

Network cables connected to servers inside a dark rack — The poisoned response came from beneath every individual site, where a file search would never find it.

What kept this alive for months is that every method the admin reached for was a reasonable way to check a website, and every one of them was independent of the thing actually being attacked. Bing and Yandex returned the real site because the module ignored them. A human loading the page in a browser saw the real site because the module ignored humans too. The only reader that ever saw the poisoned version was Google’s crawler, and the site’s owner is never that reader. You cannot open your own site as Googlebot and notice something is wrong, because the moment you look, you are looking as yourself.

A person reading documentation and code side by side on a computer monitor — A human can inspect the page they receive. The crawler-only version remains outside that view.

Strip out the malware and the structure underneath is older than the attack. Serving one version of a page to a search crawler and a different version to everyone else is cloaking, and cloaking has been a black-hat SEO technique for as long as search rankings have been worth money. The split it exploits is the gap between what a crawler reads and what a person reads, and that gap has always been a soft spot precisely because the people who would notice the difference are the ones being shown the clean version. The crawler reads something they never will.

That same split is now being built deliberately, as infrastructure, and sold as a feature. llms.txt is the clearest example: a markdown file you put at the root of your site that lists your important pages in a clean, summarised form, so a language model reading your site gets a curated version instead of parsing your HTML. Microsoft’s NLWeb goes further and turns your content into a queryable endpoint, so an agent sends a natural-language query to a standard ask method and gets structured data back rather than reading the page a person would see. When the Linux Foundation set up the Agentic AI Foundation in December 2025, the anchor members were AWS, Anthropic, Cloudflare, Google, Microsoft, and OpenAI, the companies that compete hardest on AI products agreeing to treat the agentic web as shared infrastructure, which is the same shape of competitor-collaboration that standardised the early web. We are going to maintain a second representation of our sites, written for machine readers, separate from the one humans get.

A close-up of source code displayed on a computer screen — The second representation is becoming infrastructure: maintained for machines, and rarely read by people.

The honest part is that the machine readers mostly aren’t reading it yet. Studies of llms.txt adoption put it around 10% of sites, the same study found that of the 50 most AI-cited domains only one had the file at all, and analysis of large volumes of AI crawler traffic shows the major bots overwhelmingly skip it and crawl the HTML directly, which is why Google’s John Mueller has compared it to the old keywords meta tag. So this isn’t a live emergency, it’s a direction of travel, and the direction is what matters, because the way the whole thing is being framed is as a discovery problem. The questions in every guide are about visibility: how do I get my llms.txt cited, how do I rank in AI answers, how do I make sure the model describes my brand correctly. Cloaking already showed that the crawler-only surface isn’t a visibility problem, it’s a trust problem, and the cloaking risk was raised the moment llms.txt was proposed, by the same SEO practitioners who pointed out you could serve one version to the bots and another to everyone else. llms.txt has no signature and no verification, the owner has no view of what an agent actually received versus what a human got, and the tools that generate it are built to keep it updated without you looking at the output. The integrity layer, signed feeds and content-provenance headers, exists but trails well behind the content layer that everyone is rushing to ship.

An empty desk and chair facing an open laptop beside a window — The surface can keep changing even when nobody sits down to read what it says.

I have an llms.txt on my own site and I didn’t write it by hand, the AI agent I use to help build my posts generates it as part of creating each one, and once it’s published I move on to the next post and never open the file again. I write about designing for AI readers, I maintain a surface specifically for them, and I could not tell you right now, without going to look, whether what an agent fetches from it matches what you would read on the rendered page. That isn’t a vulnerability today. My site is static, there’s no SQL injection waiting to be found, none of GhostRedirector’s machinery applies to it. But the shape is the same as the thing that hid casino spam on 65 servers for months: a surface that regenerates on a trigger I don’t watch, for a reader I’ll never be, where nothing that goes wrong would show up to a single human visitor. I should probably go and read it.