Article · Jan 9, 2026

How AI coding tools actually edit code: 6 truths from 4,000+ hours of building software

After 4,000+ hours of building software, the last stretch with Cursor, Claude Code, Aider, and Lovable, here is how AI coding tools actually edit code under the hood.

If you have used Cursor, Claude Code, Aider, or Lovable for any serious work, you’ve probably noticed the tools edit code in ways that surprised you. A change you expected to be a one-line patch comes back as a full file rewrite. A search-and-replace operation reports success and silently misses three of the four matches. A repo with two hundred files somehow gets the right context loaded for a question that mentions only one of them. After 4,000+ hours of building software (the bulk of that on production Bubble.io apps, with the last stretch on AI-assisted code across client work, my own projects, and a long string of failed experiments), here is what I have learned about how AI coding tools actually edit code under the hood. Six truths, with citations to the papers and benchmarks where I learned each one.

1. AI coding tools have two brains

Most production AI coding tools are not one model. They are two: a frontier-class model that does the reasoning, and a smaller, faster fine-tuned model that does the application. Cursor calls this fast apply. Morph sells it as an API. Claude Code’s edit tool implements a similar pattern internally.

The reason for the split is decoding speed. A frontier model generates around 50 to 200 tokens per second. For a tab-complete or an inline edit, that is fast enough to feel laggy but acceptable. For a multi-file edit where the model has to regenerate four files of 300 lines each, that is too slow to keep a developer in flow. The fine-tuned smaller model, trained specifically on the rewrite task, can hit thousands of tokens per second on the same hardware.

The frontier model decides what to change. The fast model writes it. You see one suggestion, but two models cooperated to make it.

This is why your edits sometimes feel inconsistent. The frontier model and the fast model can disagree at the edges, and the fast model is the one that wins because it produces the final bytes. Most of the “the AI lost my comment” or “it reverted my variable name” complaints I see online are this seam showing.

2. AI tools rewrite full files instead of generating diffs

This was the most counterintuitive finding for me, and the one with the strongest empirical evidence behind it. AI coding tools, given a choice between emitting a unified diff and rewriting the entire file, will overwhelmingly choose to rewrite. And the rewrite has a higher success rate than the diff under about 400 lines.

The data behind this comes from Aider’s issue #625, where the team measured edit-application success rates across both formats. Cursor’s instant-apply pattern productized the same finding. The reasons are three, all rooted in how language models work:

  • Training data distribution. Code on GitHub is stored as full files, not as patch hunks. The model has seen orders of magnitude more complete files than valid diffs during training, so its priors are calibrated for one and not the other.
  • Tokenizer behaviour. Diff format has unforgiving structure: @@ -12,7 +12,8 @@ line markers, exact-whitespace context lines, + and - prefixes that have to be in the right column. A frontier model that drifts by one character produces a diff that fails to apply. Full-file rewrites are forgiving by comparison.
  • Decoding cost. Under 400 lines, the marginal cost of rewriting the full file is small, and the success-rate gain dominates. Above 400 lines, rewrite cost grows linearly while diff cost stays roughly flat, so diffs start to win again. This is also why Cursor’s fast apply is bounded to small files.

When you understand this, the seemingly random behaviour of AI tools makes sense. They are not deciding diff vs rewrite based on “what’s cleaner.” They are picking the format with the higher empirical success rate for the file size, which is almost always rewrite.

3. Search-and-replace fails silently

If your AI tool’s only edit primitive is “find this exact string and replace it with that exact string,” you are one whitespace change away from a silent failure.

This is the failure mode community discussions like GitHub Copilot’s #152226 keep surfacing. Larger files contain more potential pattern matches. Context boundaries become harder to determine as code evolves. Comments get added. Whitespace shifts. The AI’s “find” string was correct when generated, but no longer matches anything in the file by the time the patch reaches the disk.

The worst part is that the tool reports success. It says “applied 4 edits” and moves on. You discover the missed edits when the test fails or the bug shows up in production.

The reliable AI tools either (a) use rewrite-style edits with the small fast model handling the application, or (b) check the apply result against the model’s intent and re-prompt if the diff applied to fewer hunks than expected. Tools that do neither are a footgun in any real codebase.

4. AI tools rank your codebase like Google ranks pages

You cannot fit a 200-file codebase into an LLM’s context. Every AI coding tool faces the same problem: which subset of your repo do you load when the user asks “fix the bug in checkout”?

Aider’s solution is the cleanest one I’ve seen, and it is mathematically the same algorithm Google used for web search in the 1990s.

Aider parses your entire repo with tree-sitter (130+ languages supported), extracting every symbol definition and every reference. It builds a directed graph: files are nodes, symbol references between files are edges. It then runs personalized PageRank on this graph, with the personalization weighted toward symbols mentioned in the current chat. The output is a relevance score per file, and the top-N files within a token budget become the context the LLM sees.

Edge weights are tuned: identifiers the user explicitly mentioned get a 10x multiplier, well-named (long, specific) identifiers get 10x, and files already in the chat get 50x. The result is a context window that is dense with the symbols the user actually cares about, instead of the random first 200KB of the repo.

When I worked through this paper for the first time, the parallel hit me hard. The 90s gave us PageRank for the web. The 20s gave us PageRank for codebases. Same math, different graph.

Cursor and Claude Code use different mechanisms (vector embeddings, semantic search, agentic file-reading), but they are all solving the same problem with the same constraint: budget the context to the most relevant subset, not the whole repo.

5. Closed-loop feedback is the only thing that ships

A 2025 benchmark across the major coding agents made one finding clear: the gap between agents that can run their own tests and agents that can’t is bigger than the gap between any two frontier models.

Render’s coding-agent benchmark and Artificial Analysis’s coding agents leaderboard both ranked tools partially on whether they close the loop. Closed-loop means the AI tool can see the result of its own edit: the test pass/fail, the type-checker output, the lint warnings, the runtime exception. Open-loop means it writes code, says “done,” and you discover the failure later.

The reliability difference is not subtle. An open-loop model has to be right on the first try, with all the tokenizer-noise and context-budget limits described above. A closed-loop model can be wrong twice, see the failure, fix it, and only stop when the loop closes green. For anything more complex than a one-line tweak, the closed loop is what makes the work usable.

This is also why I run Claude Code with npm run test and npm run build plumbed in as commands it can invoke directly. Without that, even Claude Opus is shipping open-loop. With it, the same model converges on correct code in two or three iterations.

6. Vibe coding does not mean not knowing how the code works

This one is contentious. The original Karpathy framing of “vibe coding” was about lowering the friction of the inner loop: describe what you want, accept the AI’s first attempt, iterate. Read Google Cloud’s writeup for the canonical version. Nothing in that framing said “don’t read the code.”

The conflation has been weaponized by two opposing camps. The “vibe coding is dangerous” camp uses it to mean “the operator doesn’t know what their code does,” then points at the resulting disasters. The “vibe coding is the future” camp uses it to mean “you don’t have to know what your code does,” then ships the same disasters and is surprised when they don’t scale.

Both camps are wrong because both treat vibe coding as a binary. It isn’t. Vibe coding is a technique, not a discipline. The discipline is reading what the AI wrote at a higher rate than the AI generates it. The technique is having the AI generate it in the first place.

The successful vibe-coded production apps I’ve inherited are all maintained by people who can answer “why does this pattern exist” without consulting the prompt that produced it. The unsuccessful ones share a common pattern: the operator stopped reading the diffs, the AI drifted across a few iterations, and now nobody can explain why a particular abstraction is shaped the way it is. The right discipline is to use the AI to lower the cost of writing code while raising the bar on the rate at which you read what it wrote.

What this means for shipping production AI-built code

If you are using these tools for real client work, the practical implications are:

  1. Pick tools with closed-loop feedback enabled. Claude Code with test/build commands wired up. Cursor with terminal access. Aider with the test loop. Open-loop tools (most IDE-extension copilots, most chat-window-only setups) are fine for prototypes and dangerous for production.
  2. Trust full-file rewrites under 400 lines, distrust diff-only output. If your tool generates a unified diff that applied to 0 of 4 hunks, that is the silent failure mode from §3. Don’t accept “applied 0 edits” as a no-op; treat it as a bug to investigate.
  3. Understand which files your tool is loading and why. If you’re on Aider, read the repomap output. If you’re on Cursor, watch which files the agent opens. If you’re on Claude Code, name the files explicitly when the change is non-obvious. The PageRank-style ranking is good but not magic.
  4. Match the tool to the task. Inline tab-completion: Cursor. Whole-repo reasoning: Claude Code. Transparent file selection: Aider. Visible UI scaffolding: Lovable / Bolt. Use one tool for everything and you will fight one of its weaknesses on every other change.
  5. Read the diffs. The cost of vibe coding is paid in production unless you read the code at the rate the AI writes it. The best engineers I know who use these tools heavily all share this discipline.

If you are running a production AI-assisted codebase across North America, the UK and Ireland, the EU and EEA, or the ANZ region, and you want a second pair of eyes on what your tool is actually doing under the hood, let’s talk. The discipline above is what gets installed on day one of every engagement.

Frequently asked questions

How does Cursor edit code without using diffs?

Cursor's "fast apply" pattern uses two models per edit. A large frontier model (Claude Opus, GPT-5, etc.) generates the new code in a structured form, then a smaller fine-tuned model rewrites the entire file with the change applied. The reason this beats classic diff-format edits is empirical: under around 400 lines, full-file rewrites have a higher success rate than search-and-replace patches, because the small model can use surrounding context to avoid the boundary errors that break diff application. Aider's own benchmarks confirmed the same finding (issue #625). Above 400 lines, diff-style edits start to win again because the rewrite token cost grows linearly while diff cost stays roughly flat.

What is Aider's repomap and how does PageRank fit in?

Aider's repomap is a tree-sitter-parsed graph of your entire repository. Files are nodes, symbol references between files are edges. Aider runs the PageRank algorithm (the same math Google used to rank web pages) on this graph to score every file by how central it is to the symbols mentioned in the current chat. The most central files get included in the LLM's context; everything else is omitted. Edge weights are boosted for identifiers the user mentioned (10x), well-named identifiers (10x), and files already in the chat (50x). Aider supports 130+ languages through tree-sitter parsers and the result is a token-budgeted summary of the most important code entities for whatever the user is asking.

Why do AI coding tools rewrite full files instead of generating diffs?

Three reasons rooted in how language models are trained and decoded. First, training data: code on GitHub is overwhelmingly stored as full files, not as patches, so the model has seen far more examples of complete files than of diff hunks. Second, tokenizer behaviour: diff format has rigid structure (@@ lines, +/-, exact whitespace) that tokenizes badly and is unforgiving when the model drifts by even one character. Third, decoding cost: under about 400 lines, the rewrite cost is small enough that the success-rate advantage outweighs the token overhead. Cursor's instant-apply pattern (and Morph's fast-apply API) productized this finding. Above 400 lines, diff format wins again because rewrite cost grows but diff success rate stays roughly constant.

Cursor vs Claude Code vs Aider, which one for which job?

Pick based on the kind of edit you are making, not based on which is "best." Cursor is an IDE fork with inline tab-complete and Composer mode; it's strongest for visible UI work, component scaffolding, and changes you want to see diffed in the editor before accepting. Claude Code is an agentic CLI; it is strongest for whole-repo reasoning, multi-file refactors, code review, documentation passes, and any task where you would rather describe the change in English than click through diffs. Aider is also CLI but with an explicit chat-loop and the most transparent repo-understanding system (the PageRank repomap). It is strongest when you want to see exactly which files were considered and why. In my own workflow, Claude Code plans and reviews, Lovable or Cursor ships the visible UI, and I act as the human referee on every handoff.

What is closed-loop feedback in AI coding and why does it matter?

Closed-loop feedback means the AI tool can see the result of its own edit (test output, type-check errors, lint warnings, runtime exceptions) and iterate against that signal automatically. Without it, the model writes code, reports done, and you discover the bug at runtime. With it, the model writes code, runs the test, sees the failure, fixes the failure, and only stops when the loop closes green. Claude Code, Aider, and Cursor's agent mode all support some form of this. The reason it is the single biggest reliability lever in 2026 AI coding is that an open-loop model has to be right on the first try, while a closed-loop model can be wrong twice and still ship correct code. For production work, the closed loop is non-negotiable.

Does vibe coding mean you don't need to know how the code works?

No, and the conflation is what produces most of the spectacular failures attributed to "vibe coding." The original Karpathy framing of vibe coding was about lowering the friction of the inner loop, not about lowering the bar on understanding. The successful vibe-coded production apps I've inherited are all maintained by people who know what their code does, even if they didn't type every character of it. The unsuccessful ones share a common pattern: the operator stopped reading the diffs, the AI drifted, and now nobody can explain why a particular pattern exists. The right discipline is to use the AI to lower the cost of writing code while raising the bar on the rate at which you read what it wrote.