Validate LLM Output at the Boundary

Validate LLM Output at the Boundary | Generated by AI

Home 2026.05

A Jekyll build started failing today on a YAML parse error in a post’s front matter. The post in question — generated by my ww note tool from clipboard content — had a title: field that wasn’t a title at all. It was a four-paragraph Chinese essay, complete with newlines, full-width colons, and bullet lists, all crammed into a single YAML key.

The pipeline looked like this: clipboard content → LLM call asking for “a very short title in English (maximum six words)” → write the response into title: in front matter. The LLM ignored the instruction. The code trusted the response. YAML choked.

The fix was three lines:

title = re.sub(r"\*", " ", raw).strip()
if len(title) >= TITLE_MAX_CHARS:
    raise ValueError(f"Generated title is {len(title)} chars (must be < {TITLE_MAX_CHARS}): {title!r}.")
return title

The lesson is older than LLMs: don’t trust output from a system you don’t control, especially at a boundary where invariants change format. The LLM is a remote, non-deterministic service. Its output crosses into a structured format (YAML) where a single newline in the wrong place corrupts the whole document. That boundary deserves the same scrutiny you’d give a user-submitted form field — schema check it, length-bound it, fail loud.

What made this bug interesting wasn’t that the LLM misbehaved. LLMs misbehave routinely; that’s expected. What made it interesting is that the failure was silent at the source and loud somewhere else entirely. The note got written to disk without any error. The file looked fine to a human eye. Days later, a Jekyll workflow on a different machine surfaced the corruption as a unit test failure with a stack trace pointing at line 15 of a file that hadn’t been touched in days.

That distance — between the moment the bad data is produced and the moment something notices — is where engineering gets expensive. Validating at the boundary collapses the distance to zero. The exception fires inside generate_title, three frames from the LLM call, with the bad string in the message. You see it the moment it happens, on the machine where you can fix it.

A few corollaries I want to remember:

Prompt instructions are not constraints. “Maximum six words” is a hint to the model. It is not a guarantee. If you need a guarantee, you need code that enforces it after the response comes back. This is true for length, format, language, and content; it’s the same reason you don’t trust <input maxlength="6"> as your only defense against long input on the server.

The “happy path test” lies. When I tested ww note originally, the LLM returned reasonable titles every time. Six-word titles. Real titles. The validation gap was invisible because the LLM was cooperating. The bug only surfaced when the model — for whatever reason on that particular call — decided to ignore the instruction and dump an essay. Validation isn’t for the cases you’ve seen; it’s for the cases you haven’t.

Failing loud beats failing silent, even when “loud” means crashing the script. The previous behavior was: corrupt note silently written, build breaks days later in CI. The new behavior is: ww note exits with a ValueError and a clear message, and I rerun it. The second one is unambiguously better. Silent corruption is the worst failure mode there is, because by the time you notice, the evidence has been backed up, synced across machines, and possibly copied into other documents.

I added a test for the specific Chinese-essay case that triggered this. Not because that exact input will happen again — it almost certainly won’t — but because the test now documents the contract: generate_title is allowed to fail, but it is not allowed to return a 200-character paragraph. If anyone (including future me) loosens that check, the test will object.

The broader pattern, which I want to apply more aggressively in code that touches LLMs: every value that crosses from “model output” into “structured format” gets a validator. Title strings get a length cap. JSON responses get a schema check. Filenames get a regex. Code outputs get a parse. The cost is a few lines per boundary; the saving is not waking up to a broken Jekyll build because a remote model had an off day.

Back Donate