The Test That Destroyed Production Data
A vitest case deliberately wrote broken JSON to check whether the code survives it. That went fine — until one line of existsSync at the top of the resolver flipped the order and the same test wrote into real production data.
There's a vitest case in the Nest codebase that deliberately writes broken JSON. It takes the path where the agent registry lives, writes {not json into it and then checks: does the code survive that? Does it fall back to an empty default instead of crashing? That's a good test. Robustness against corrupt persistence is exactly the kind of thing you want to explicitly secure, because it never occurs in normal operation and then eventually does.
The test did exactly what it was supposed to for a long time. Until it started destroying the real production data.
How it was before
resolveRegistryPath() had a simple order: if there's an environment variable that sets the path, take it. Otherwise fall back to a default. The test set the variable to a path in its own temp directory, wrote garbage there, ran the code over it, and cleaned up at the end. A closed loop. The test only touched what belonged to it.
This is what it looked like at its core:
function resolveRegistryPath() {
if (process.env.REGISTRY_PATH) return process.env.REGISTRY_PATH;
return defaultRegistryPath();
}
Nothing spectacular. Env first, default after. The test safety lay implicitly in this order: because the env override stood at the very top, the test could be sure its set path wins.
Why I changed it
The Nest daemon needed a fixed place for agents.json — /var/openape/nest/agents.json, because the daemon runs as its own service user with its own HOME and can't rely on what happens to be in the environment of the calling process. So a check came in at the top of the resolver: if the canonical file exists, take that.
function resolveRegistryPath() {
if (existsSync("/var/openape/nest/agents.json")) {
return "/var/openape/nest/agents.json";
}
if (process.env.REGISTRY_PATH) return process.env.REGISTRY_PATH;
return defaultRegistryPath();
}
For the daemon that was right. It finds its registry reliably, no matter who starts it. What I didn't think along: this existsSync sits above the env check.
How it looks now
On my dev box the Nest daemon was running. It had long created /var/openape/nest/agents.json. So the file existed.
Then the test ran. It dutifully set REGISTRY_PATH to its temp directory, asked resolveRegistryPath() where the registry sits — and got /var/openape/nest/agents.json back. Not its temp path. The real one. Because the existsSync branch stands before the env branch and on a dev box where the daemon has run at least once, always hits.
So the test wrote its {not json not into its temp directory. It wrote it into the registry the running daemon reads. A test that was supposed to prove the code survives corrupt data made the data corrupt — in production, on a machine where real agents were registered.
What fell away
The assumption that a test only touches what belongs to it. That assumption was never anchored in the code. It hung on the env override happening to be at the top of the resolver. As long as that was so, the assumption was true. The moment a new branch came above it, it no longer was — and nothing screamed, because an order has no signature a type checker could check.
The fix is small: env override all the way to the top, before any filesystem probing. And mandatory in the test setUp — the test checks that the override is set before it writes anything. No override set, no write access.
function resolveRegistryPath() {
if (process.env.REGISTRY_PATH) return process.env.REGISTRY_PATH;
if (existsSync("/var/openape/nest/agents.json")) {
return "/var/openape/nest/agents.json";
}
return defaultRegistryPath();
}
This guard should have come in the same commit as the existsSync branch. Not as an afterthought, after the box lost data once. Whoever pulls a branch into the top of a path resolver changes the precedence for everything that writes through that resolver — and that includes every test that has ever written over it.
The order is the contract
I had treated the resolver as an implementation detail. It determines a path, the behavior is obvious, there's nothing to document. Wrong. The precedence in a path resolver is no detail, it's an API. Every caller — and the loudest caller here was the test that deliberately writes destructively — relies on a specific order, even if nobody writes it down anywhere.
A robustness test is only as safe as the precedence it trusts. The test was correct. The resolver was correct. Both on their own did exactly what they were supposed to. The only broken thing was the order between them — and that belonged to no one, so no one checked it.