Why AI tools rewrite full files instead of using diffs

Part of the Ai Assisted Web Apps series Post 2 of 7

If you have used Cursor, Aider, or Morph’s apply API for any non-trivial edit, you have probably noticed that the tool prefers to rewrite an entire file instead of generating a unified diff. The behaviour seems wasteful at first (more tokens, more time to read the response, more bytes on the wire), but it is the empirically correct choice. After Aider’s team published issue #625 in 2024 with the now-famous finding that “fully rewriting the full file outperforms aider-like diffs for files under 400 lines,” the rest of the industry quietly converged on the same architecture. Here is what is actually going on.

The empirical finding

Three independent investigations reached the same conclusion within twelve months of each other:

Aider issue #625 (June 2024). Paul Gauthier benchmarked search-and-replace blocks (Aider’s native format) against full-file rewrites on a controlled edit task and found rewrites had a measurably higher success rate under 400 lines.
Cursor’s instant-apply blog post (October 2024). The Cursor team described their “fast apply” architecture: instead of asking the frontier model to emit a diff, they have it emit the change in structured form, then run a small fine-tuned model that rewrites the file at roughly 1,000 tokens per second.
Morph’s fast-apply paper (2025). Morph generalized Cursor’s pattern into a hosted API: any agent that currently asks a frontier model to rewrite the file can swap in Morph and cut token usage 50–60% and latency 90% or more, while keeping the rewrite-style architecture.

The convergence is striking. Three different teams, three different starting points, same conclusion: under 400 lines, rewriting beats diffing. The reasons are three, all rooted in how language models work.

Reason 1: training data distribution

Code on GitHub is overwhelmingly stored as complete files, not as patch hunks. When a language model is pre-trained on a large code corpus, the ratio of full-file examples to valid-diff examples is something like 1,000:1.

This shapes the model’s priors. A model asked to emit a complete file is operating in a region of distribution it has seen millions of times. A model asked to emit a unified diff is operating in a region it has seen thousands of times, mostly in commits and blog posts, almost never as the primary output of a long synthesis task.

You can see this in failure-mode statistics. Fabian Hertwig’s analysis and Morph’s diff-format breakdown both put unified-diff success rates at 70–80% on complex files, with the gap widening as files grow. Full-file rewrites in the same conditions clear 95%+. The model is not “bad at diffs” in some abstract sense; it has simply seen far less of them, and far less of them under task pressure.

Reason 2: tokenizer behaviour and unforgiving syntax

Unified diff format has rigid structure. Every line starts with one of four characters (+, -, , or @). Hunk headers have the form @@ -old_start,old_count +new_start,new_count @@. Context lines must match the source file character-for-character including whitespace.

This unforgiveness collides with how language models tokenize code. The byte-pair-encoding tokenizers used by Claude, GPT, and Llama do not treat leading whitespace as a first-class element; they fold it into the next code token. So when a model is asked to emit a diff hunk, it has to coordinate three things at once: the prefix character (+, -, ), the indentation, and the actual code. A frontier model that drifts by even one space, or merges two whitespace tokens incorrectly, produces a diff that fails to apply at all.

Full-file rewrites have none of this. The model emits code, the tool writes the file, the next read-back is the source of truth. There is no syntactic frame outside the language itself. Errors that would silently break a diff just produce slightly different code that the test suite catches.

Reason 3: decoding cost under 400 lines

The third reason is economic. Under 400 lines, a frontier model rewriting the full file emits maybe 4,000–6,000 tokens. At Claude Sonnet pricing that is something like 3–5 cents and 5–6 seconds. The diff alternative would be 200–500 tokens at maybe 0.5–1 second, but with a 20–30% chance of failing to apply.

Run the math: if 25% of diffs fail, the expected cost of a diff-based edit is 1.25 × diff_cost + 0.25 × retry_cost. If retry cost is similar to the original generation, you are paying 1.5x the diff cost on average, plus the wall-clock cost of the retry, plus the human-in-the-loop cost of noticing the silent failure. Under 400 lines, this expected cost is comparable to or worse than just rewriting the file once.

This is why Cursor’s speculative-edits algorithm exists. It uses the original file as the deterministic draft for speculative decoding, so the small fast-apply model only has to verify tokens for unchanged regions instead of regenerating them. The result is a full-file rewrite at roughly diff-token speed, which dominates both formats simultaneously.

Where the math flips above 400 lines

Above 400 lines, the same calculation runs the other way. A 2,000-line file is 16,000–24,000 output tokens to rewrite. At frontier pricing that is 30–50 cents and 30–60 seconds, every time you change a comment.

Diff format does not have this scaling problem. A small change to a 2,000-line file is still a small diff. The 70–80% success rate is unchanged from smaller files; it is fundamentally a function of file complexity, not file size. So at large files, diff cost is much lower while the failure rate is the same, and diff format starts to dominate again.

This is why Aider, Claude Code, and most CLI-based agents still support diff-style edits as a fallback for large files, and why Cursor’s fast-apply documentation specifically calls out that the technique is bounded to small files. There is no globally-best edit format. There is a file-size threshold, and 400 lines is a working approximation of where it sits.

What this means for tool choice

Three practical takeaways:

For most app-development work, full-file rewrites are the right default. Component files, route handlers, schema definitions, and most utility modules are well under 400 lines, and the rewrite-style edit is more reliable.
For large monolithic files, choose tools that switch formats. Aider explicitly switches between rewrite, search-and-replace, and udiff based on file size. Claude Code’s edit tool does similar logic internally. Tools that always use one format will be wrong above or below the threshold.
Treat “applied 0 of N edits” as a bug, not a no-op. If your AI tool reports a partial application of a diff, the model thought there was an edit to make but the apply step couldn’t find it. That is a silent failure, and it is the most common cause of “the AI said it fixed it but the bug is still there.”

If you are running a production codebase with mixed-size files, the practical heuristic is to know which format your AI tool is using on each edit, and to match the file size to the format. Tools that hide the choice from you will be wrong roughly a quarter of the time on edits at the boundary, and you will not know which quarter unless you read every diff.

The full series this post is part of: How AI coding tools actually edit code: 6 truths from 4,000+ hours of building software.
What happens when an AI tool’s edit format collides with a real-world platform shift: Migrating to Supabase publishable keys broke my Chrome extension.

If you are a SaaS founder or product team across North America, the UK and Ireland, the EU and EEA, or the ANZ region, and you want a second pair of eyes on what edit format your AI coding tool is actually using on your codebase, let’s talk.

Frequently asked questions

Why don't AI coding tools use unified diff format?

Unified diff format reaches only 70-80% accuracy on complex files. The failure modes are predictable: incorrect hunk line numbers (the LLM gets the @@ -old,+new @@ headers wrong), context drift (the file has changed since the LLM last saw it), and anchor matching failures (the patch context matches zero locations or multiple ambiguous locations in the file). Whitespace, comment additions, or formatter runs between generation and application can break a diff that was syntactically correct when emitted. Full-file rewrites avoid all four failure modes by handing the model the entire file as context and trusting it to reproduce the unchanged parts verbatim.

What is Cursor's fast apply?

Cursor's fast apply is a two-model architecture for code edits. A frontier model (Claude, GPT, etc.) generates the change in structured form, then a smaller fine-tuned model rewrites the entire file with the change applied at roughly 1,000 tokens per second. The key technique is "speculative edits," a custom speculative-decoding algorithm that uses the original file as the deterministic draft, so the small model only has to verify tokens for unchanged regions instead of regenerating them. The end result is functionally equivalent to a full-file rewrite but up to 9x faster than a frontier-model rewrite, with comparable accuracy.

When do AI-generated diffs win over full-file rewrites?

Around 400 lines is the empirically observed boundary in Aider's benchmarks (issue #625) and Cursor's internal testing. Below that, full-file rewrites have a higher success rate and the rewrite token cost is small enough that the accuracy gain dominates. Above 400 lines, rewrite cost grows linearly while diff success rate stays roughly constant, so diff format wins on token economics even though its application failure rate is higher. The exact threshold varies by language and file structure, but 400 is a reasonable working number.

What is Morph's fast apply API?

Morph is a hosted "fast apply" model: a specialized small model trained specifically on the merge-an-edit-into-a-file task. The pitch is that any agent currently asking a frontier model to rewrite the full file can swap in Morph's API and cut token usage by 50-60% and latency by 90% or more. A 1,000-line file takes 1.3 seconds through Morph versus 10-12 seconds through Claude Sonnet doing a full rewrite. The accuracy stays in the same range as full-file rewrites because the architecture is the same; the specialization is in decoding speed and cost, not in the format.

How does Aider apply edits, and why is it different?

Aider uses search-and-replace blocks: the LLM emits a SEARCH section (the exact text to find) and a REPLACE section (the new text), with the surrounding context as the anchor. Aider then matches the SEARCH section against the file using a cascade of strategies (exact match, anchor-based, string similarity, Levenshtein distance) and applies the REPLACE if the confidence score passes. This avoids the line-number drift problem of unified diff format but inherits the context-drift problem when the file has changed since the LLM saw it. Aider's own benchmarks (issue #625) showed full-file rewrites beat this format under 400 lines, which is why most newer tools have moved that direction.

Should I worry about which edit format my AI coding tool uses?

Yes, for two reasons. First, "applied 0 of 4 edits" is a real failure mode that AI tools report as success and silently leave your code in a half-applied state. If your tool uses a diff format, watch for this and treat it as a bug. Second, edit format affects cost and latency: full-file rewrites consume more output tokens than diffs, so on large files (>400 lines), diff-based tools will be cheaper and faster, but with a higher probability of partial-application bugs. Pick the format that matches your file sizes and your tolerance for silent failures.

Tagged

Why AI coding tools rewrite full files instead of using diffs

The empirical finding

Reason 1: training data distribution

Reason 2: tokenizer behaviour and unforgiving syntax

Reason 3: decoding cost under 400 lines

Where the math flips above 400 lines

What this means for tool choice

Frequently asked questions

More from this blog

How AI coding tools actually edit code: 6 truths from 4,000+ hours of building software

How Aider's repomap uses PageRank and tree-sitter to rank your codebase

How I document AI-built projects: a CLAUDE.md, ISSUES.md, and prompts/ workflow

The empirical finding

Reason 1: training data distribution

Reason 2: tokenizer behaviour and unforgiving syntax

Reason 3: decoding cost under 400 lines

Where the math flips above 400 lines

What this means for tool choice

Related reading

Frequently asked questions

More from this blog

How AI coding tools actually edit code: 6 truths from 4,000+ hours of building software

How Aider's repomap uses PageRank and tree-sitter to rank your codebase

How I document AI-built projects: a CLAUDE.md, ISSUES.md, and prompts/ workflow