ai-models · 3 min read

Frontier LLMs Corrupt 25% of Documents in Long Workflows, New Benchmark Shows

Even the best LLMs corrupt roughly 25% of document content in long workflows. Here's what that means for your pipelines.

A new paper from researchers Philippe Laban, Tobias Schnabel, and Jennifer Neville lands a punch that anyone building document automation pipelines needs to read: even the best models available today quietly destroy roughly a quarter of your document content by the time a long workflow finishes. Not sometimes. On average.

The paper, submitted April 17, 2026, introduces the DELEGATE-52 benchmark and tests 19 LLMs across 52 professional domains. The results are a problem for anyone who has started treating LLMs as reliable autonomous editors.

What "Delegation" Actually Means

Delegation, in this context, means handing a model a document and a sequence of tasks, then walking away. Think vibe coding where you ask an LLM to iteratively refactor a codebase. Or a data pipeline where a model cleans and reformats files across multiple steps. Or a writing assistant that edits a report through several rounds of feedback.

The implicit contract in all of these is trust: you expect the model to do what you asked without silently breaking what you didn't ask it to touch. That contract, according to this research, is not being honored.

"Delegation requires trust — the expectation that the LLM will faithfully execute the task without introducing errors into documents."

The DELEGATE-52 Benchmark

DELEGATE-52 is purpose-built to stress-test this exact problem. It simulates long, multi-step document editing workflows across 52 professional domains, each with domain-specific document formats and constraints. The domains range from coding and crystallography to music notation, so this isn't just testing prose editing. It's testing whether models can handle structured, high-stakes documents where a single corrupted field matters.

The benchmark ran all 19 tested models through iterative editing tasks and tracked how document quality degraded across the interaction sequence. It also tested tool-augmented agents, the kind of setup often pitched as more reliable than vanilla prompting.

What the Results Show

The headline number: frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows. Non-frontier models do worse. There's a clear performance gap between tiers, but the gap doesn't matter much in practice because even the best-in-class models are failing at a rate that would be unacceptable in any real production system.

A few other findings worth paying attention to:

"Current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction."

Why Silent Corruption Is the Real Problem

The "silently" part is what makes this dangerous in practice. If a model returned an error or flagged uncertainty, you could build a review step. But corruption that looks like valid output is much harder to catch, especially across a long pipeline with many steps. By the time you notice something is wrong, you may be several transformations removed from where the damage started.

The paper notes that degradation gets worse as document size grows, as the number of interactions increases, and as the context gets cluttered with distractor files. These are all normal conditions in real workflows, not edge cases. A long coding session, a multi-chapter document, a folder of related files: all of these push you toward higher corruption rates.

What This Means for Developers

If you're using LLMs for short, single-step document tasks with human review before anything goes anywhere, you're probably fine. The risk profile here is specifically about long, iterative, delegated workflows where the model is operating autonomously across many steps.

A few practical takeaways:

The 52-domain scope of DELEGATE-52 also matters: this isn't a narrow finding about one document type. Crystallography files and music notation are very different from Python code, but all showed the same fundamental degradation pattern. That suggests the problem is in how models handle iterative editing generally, not in something specific to one format.

Bottom Line

Frontier LLMs are not ready for unsupervised, long-horizon document editing in production. A 25% corruption rate across the best available models is not a rounding error; it's a fundamental reliability gap. If your architecture depends on delegating multi-step document work to a model and trusting the output, this research is a strong signal to add validation layers before that ship sails. Watch DELEGATE-52 as a benchmark worth tracking as new models release.

Sources


You Might Also Like

The weekly digest

Every Sunday: the 5 AI tools, papers, and posts worth your time.

Curated by humans, sent at 9am ET. No sponsored content in the main feed — affiliates are clearly marked.