A new paper from researchers Philippe Laban, Tobias Schnabel, and Jennifer Neville lands a punch that anyone building document automation pipelines needs to read: even the best models available today quietly destroy roughly a quarter of your document content by the time a long workflow finishes. Not sometimes. On average.
The paper, submitted April 17, 2026, introduces the DELEGATE-52 benchmark and tests 19 LLMs across 52 professional domains. The results are a problem for anyone who has started treating LLMs as reliable autonomous editors.
What "Delegation" Actually Means
Delegation, in this context, means handing a model a document and a sequence of tasks, then walking away. Think vibe coding where you ask an LLM to iteratively refactor a codebase. Or a data pipeline where a model cleans and reformats files across multiple steps. Or a writing assistant that edits a report through several rounds of feedback.
The implicit contract in all of these is trust: you expect the model to do what you asked without silently breaking what you didn't ask it to touch. That contract, according to this research, is not being honored.
"Delegation requires trust — the expectation that the LLM will faithfully execute the task without introducing errors into documents."
The DELEGATE-52 Benchmark
DELEGATE-52 is purpose-built to stress-test this exact problem. It simulates long, multi-step document editing workflows across 52 professional domains, each with domain-specific document formats and constraints. The domains range from coding and crystallography to music notation, so this isn't just testing prose editing. It's testing whether models can handle structured, high-stakes documents where a single corrupted field matters.
The benchmark ran all 19 tested models through iterative editing tasks and tracked how document quality degraded across the interaction sequence. It also tested tool-augmented agents, the kind of setup often pitched as more reliable than vanilla prompting.
What the Results Show
The headline number: frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows. Non-frontier models do worse. There's a clear performance gap between tiers, but the gap doesn't matter much in practice because even the best-in-class models are failing at a rate that would be unacceptable in any real production system.
A few other findings worth paying attention to:
- Agentic tool use offers no improvement. If you were planning to solve this with a more elaborate agent setup, the data says don't count on it.
- Larger documents make things worse. Longer interaction sequences make things worse. The presence of distractor files in the workflow also exacerbates degradation.
- The errors are sparse but severe, and they accumulate silently. You won't necessarily see them happening.
"Current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction."
Why Silent Corruption Is the Real Problem
The "silently" part is what makes this dangerous in practice. If a model returned an error or flagged uncertainty, you could build a review step. But corruption that looks like valid output is much harder to catch, especially across a long pipeline with many steps. By the time you notice something is wrong, you may be several transformations removed from where the damage started.
The paper notes that degradation gets worse as document size grows, as the number of interactions increases, and as the context gets cluttered with distractor files. These are all normal conditions in real workflows, not edge cases. A long coding session, a multi-chapter document, a folder of related files: all of these push you toward higher corruption rates.
What This Means for Developers
If you're using LLMs for short, single-step document tasks with human review before anything goes anywhere, you're probably fine. The risk profile here is specifically about long, iterative, delegated workflows where the model is operating autonomously across many steps.
A few practical takeaways:
- Don't trust LLMs as autonomous editors on high-stakes documents without checkpoints. Build in human review or automated validation at each major step, not just at the end.
- Distractor files in context are a real risk factor. Keep the model's working context as clean and focused as possible.
- Tool-augmented agents are not a workaround for this problem, at least not based on current evidence. Adding tools doesn't improve delegation reliability.
- Shorter, more targeted interactions are safer than long, multi-step sessions. If you can break a workflow into smaller discrete tasks with validation between them, do it.
The 52-domain scope of DELEGATE-52 also matters: this isn't a narrow finding about one document type. Crystallography files and music notation are very different from Python code, but all showed the same fundamental degradation pattern. That suggests the problem is in how models handle iterative editing generally, not in something specific to one format.
Bottom Line
Frontier LLMs are not ready for unsupervised, long-horizon document editing in production. A 25% corruption rate across the best available models is not a rounding error; it's a fundamental reliability gap. If your architecture depends on delegating multi-step document work to a model and trusting the output, this research is a strong signal to add validation layers before that ship sails. Watch DELEGATE-52 as a benchmark worth tracking as new models release.