Article · Feb 20, 2026

How Aider's repomap uses PageRank and tree-sitter to rank your codebase

Aider parses your repo with tree-sitter, builds a symbol graph, and runs PageRank to pick which files the LLM should see. Here is how it works.

If you have used Aider for any non-trivial change in a multi-file codebase, you have probably noticed that the tool somehow knows which files to include in the prompt without you naming them. Type “fix the bug in the checkout flow” and Aider includes the cart component, the checkout API handler, and the relevant tests, while leaving the unrelated marketing pages out. The data structure that makes this work is called the repomap, and the algorithm that ranks files inside it is personalized PageRank running on a tree-sitter-parsed symbol graph. This post is a tour through how that works and why it is the cleanest solution to the context-budget problem any AI coding tool has shipped.

This is a deep-dive on Truth #4 from How AI coding tools actually edit code: 6 truths from 4,000+ hours of building software, and pairs with the post on diff vs full-file rewrites.

The context-budget problem every AI coding tool faces

LLMs have context windows. Claude Sonnet’s standard window is 200K tokens. GPT-5’s is comparable. A typical SaaS codebase of 50–100 files in TypeScript or Python is 30,000–80,000 lines, which expands to roughly 100,000–250,000 tokens when fully tokenized. So even small codebases either exceed the context window or come close enough that you cannot afford to load everything.

Every AI coding tool faces the same question on every prompt: which subset of the repo do I show the model? The naive answer (“the files the user named”) is often not enough; the user might say “fix the bug in checkout” without mentioning which files implement checkout, and the model has no way to find them without the tool’s help.

Three approaches have emerged:

  • Vector-embedding semantic search (Cursor, Sourcegraph, GitHub Copilot Workspaces). Files are chunked and embedded; the prompt is embedded; the closest chunks by cosine similarity are loaded.
  • Agentic file discovery (Claude Code in agentic mode, OpenAI’s coding agents). The agent runs grep, glob, and find commands to discover relevant files, reading the output and asking for more.
  • Symbol-graph ranking (Aider, RepoMapper, several MCP servers). The repo is parsed into an AST, symbol references become graph edges, and a graph-ranking algorithm scores files by structural prominence.

Aider chose the third path, and the algorithm it landed on is PageRank.

How Aider parses your repo with tree-sitter

The repomap pipeline starts by walking the repo and parsing every file Aider supports. Tree-sitter is the parser layer because it has three properties that matter for this use case:

  • Multi-language coverage. Tree-sitter ships parsers for 130+ languages, which means Aider’s repomap pipeline does not need a language-specific implementation for Python, Rust, TypeScript, Java, Go, and the long tail. The same code path handles all of them.
  • Incremental parsing. Tree-sitter can re-parse a file given only the changed bytes, not the full source. For Aider running on every chat turn, this is what keeps performance acceptable.
  • Fault tolerance. Tree-sitter produces a partial AST for syntactically broken code. The repomap can index a file the user has half-edited without crashing, which is critical because Aider operates on working-tree state, not just committed code.

For each file, tree-sitter emits an AST. Aider walks the AST and extracts two kinds of nodes:

  • Definitions. Functions, classes, methods, types, exported constants. The places where a symbol enters the codebase.
  • References. Call sites, type usages, imports, attribute accesses. The places where one file mentions a symbol defined in another.

These two sets become the vocabulary for the next step.

The symbol graph: nodes, edges, and edge weights

Aider builds a directed graph where:

  • Nodes are files (one node per source file).
  • Edges are references between files: an edge from file A to file B exists if A references a symbol defined in B.
  • Edge weights reflect how strong the reference is. A reference to a long, specific identifier (e.g., processStripeWebhook) gets a higher weight than a reference to a generic one (e.g., helper).

The exact edge-weight formula in Aider’s repository-mapping documentation uses several multipliers:

  • Mentioned identifiers (10x). Symbols the user explicitly named in the chat get a 10x multiplier on edges referencing them.
  • Well-named identifiers (10x). Long, specific names (“authenticateOAuthCallback” rather than “doAuth”) get a 10x multiplier under the assumption that they encode more information about the relationship.
  • Files already in the chat (50x). Files the user has explicitly added to the chat get a 50x multiplier on outgoing edges, biasing PageRank toward their neighborhoods.

The result is a sparse, weighted, directed graph. For a 100-file repo, this graph has roughly 100 nodes and a few hundred to a few thousand edges, depending on how interconnected the codebase is.

Why PageRank is the right algorithm for codebase ranking

Once you have the graph, the question is “which files are most central?” This is the question PageRank was designed to answer.

PageRank was introduced by Larry Page and Sergey Brin in 1998 as the ranking algorithm behind Google. The intuition is recursive: a node is important if other important nodes point at it. The math is a damped-random-walk on the graph; the stationary distribution of that walk is the PageRank score.

For codebase ranking, the parallel is exact:

Web (1998)Codebase (2024)
PagesFiles
HyperlinksSymbol references
A page is important if many important pages link to itA file is important if many important files reference its symbols
Search queryChat prompt

PageRank has three properties that carry over cleanly:

  • Convergence. The algorithm converges in tens of iterations even on graphs with millions of nodes. For a typical codebase graph (hundreds to low thousands of nodes), convergence is sub-millisecond.
  • Scale-invariance. Adding nodes to the graph scales the runtime linearly. Doubling the codebase doubles the work.
  • Robustness. PageRank is well-behaved on noisy, partial, or cyclic graphs. Real codebases are all three.

Aider’s implementation uses NetworkX’s PageRank, which is the standard Python implementation. There is no machine learning here; just well-understood graph math.

Personalization: how chat context biases the ranking

Plain PageRank assigns equal restart probability to every node, which means it produces a ranking that is a property of the graph alone. That is not what Aider wants; the same repo should produce different rankings for different chat questions.

The standard extension is personalized PageRank. Instead of an even restart distribution, you supply a vector that biases the random walk toward specific nodes. Aider builds this vector per chat turn:

  • Files the user has added to the chat get high restart weight (effectively, the algorithm spends more time near them).
  • Identifiers mentioned in the user’s prompt boost the restart weight on files that define those identifiers.
  • Recent file edits get a small boost, on the theory that the user is likely working in adjacent code.

The output is a per-file score where files structurally near the user’s question rank highest. Aider then walks down the ranked list and includes files in the prompt up to a configurable token budget (default around 1K tokens of repomap summary, but this is tunable).

What other tools do differently

Cursor’s context selection leans heavily on vector embeddings for semantic search and on the agent’s ability to open files it predicts will be relevant. The trade-off is that Cursor’s choices are less inspectable than Aider’s; you cannot dump the embedding-space neighborhood and reason about why a file did or did not make the cut.

Claude Code in agentic mode tends to read files explicitly via shell commands. The user can see the agent typing grep -r "authenticate" src/ and watch which files come back. This is the most transparent of the three approaches but also the slowest, because every file read is a turn-cost.

Aider’s PageRank repomap sits in the middle: more transparent than embeddings, more deterministic than agentic search. The same repo and same prompt produce the same ranking. You can dump the repomap to inspect what the model actually saw. For debugging “why did the model not see file X,” that property is worth a lot.

Practical implications for daily use

Three operational habits if you are using Aider:

  1. Use long, specific identifiers in your code. PageRank’s edge-weight formula gives a 10x multiplier to long names. A processStripeWebhook is more discoverable to the repomap than a generic handle. This is good naming hygiene anyway, with the side effect of better repomap rankings.
  2. Mention identifiers in your prompt when the question is structurally ambiguous. “Fix the bug in the checkout flow” is a vector-embedding-friendly prompt. “Fix the bug in processCheckoutPayment” gives PageRank a 10x edge-weight boost on every reference and dramatically narrows the ranking.
  3. Add the most relevant file to the chat explicitly when you know it. The 50x multiplier on chat-included files dominates the ranking. If you know the auth bug is in auth/oauth.ts, add that file with /add and let PageRank propagate from there.

The 90s gave us search; the 20s gave us this

When I worked through Aider’s repomap for the first time, the parallel to Google’s original algorithm hit hard. The 1998 paper that introduced PageRank for the web is intellectually the same paper Aider is using to rank your codebase in 2024. Same math, different graph, twenty-six years later.

It is also a reminder that AI coding tools are not magic. The infrastructure under them is well-understood algorithms, applied to a new domain. PageRank works on codebases for the same reason it worked on the web: both are sparse directed graphs where importance flows along edges.

If you are using AI coding tools on production codebases across North America, the UK and Ireland, the EU and EEA, or the ANZ region, and you want a second pair of eyes on what your tool’s context-selection layer is actually doing, let’s talk.

Frequently asked questions

What is Aider's repomap, and why does it exist?

Aider's repomap is a token-budgeted summary of the most important code entities in your repository, generated automatically before every chat turn. It exists because LLMs have context windows (200K tokens for Claude, similar for GPT) and even small codebases of 50-100 files exceed that budget when fully expanded. Aider needs to choose which files to include in the prompt, and the repomap is the data structure that drives that choice. It is regenerated automatically as files change, scoped to the active repo, and cached between turns to keep performance fast.

Why does Aider use PageRank specifically?

Because the codebase-ranking problem is structurally identical to the web-ranking problem Google solved with PageRank in 1998. Both are directed graphs (files reference each other via symbols, web pages reference each other via hyperlinks). Both need to score importance where the importance of a node depends on the importance of the nodes pointing at it. Both benefit from personalization (Google's "personalized search" extends PageRank with a bias vector; Aider biases toward symbols the user just mentioned). PageRank's mathematical properties (convergence, scale-invariance, robustness to graph structure) carry over cleanly. Aider uses NetworkX's PageRank implementation under the hood.

How does tree-sitter fit into Aider's repomap?

Tree-sitter is the parser that turns your source files into ASTs (abstract syntax trees) so Aider can extract symbol definitions and references. Aider supports 130+ languages through tree-sitter parsers, which means the same repomap pipeline works on Python, Rust, TypeScript, Java, Go, and dozens more without language-specific code. Tree-sitter is incremental (it only re-parses what changed) and fault-tolerant (it produces partial ASTs for syntactically broken code). For Aider's use case (run on every chat turn, must handle working-state code that may not yet compile), these properties are non-negotiable.

What is the personalization vector in Aider's PageRank?

Standard PageRank assigns equal restart probability to every node. Personalized PageRank biases the restart probability toward a chosen subset, so the algorithm converges to a ranking that favors graph neighborhoods near those nodes. Aider's personalization vector weights identifiers the user explicitly mentioned (10x), well-named identifiers like long and specific function names (10x), and files already in the chat (50x). The result is that "show me where the auth flow lives" produces a ranking dominated by files that reference auth-related symbols, even if those files are buried deep in the repo.

How is Aider's repomap different from Cursor's or Claude Code's context selection?

Cursor and Claude Code use different mechanisms for the same problem. Cursor relies more on vector embeddings for semantic search and lets the agent open files it predicts will be relevant, with the IDE feeding relevant snippets back to the model. Claude Code in agentic mode tends to read files explicitly via grep and glob commands, producing a more conversational discovery process visible to the user. Aider's PageRank approach is the most transparent of the three (you can dump the repomap and inspect it) and the most deterministic (the same repo and same chat produce the same ranking). The trade-off is that PageRank ranks files by topology, not by semantic similarity to the question; the other tools complement that with embedding-based reasoning.

How big a codebase can Aider's repomap handle?

Aider's repomap scales well into the hundreds of thousands of lines of code. Tree-sitter parsing is fast (sub-second on most repos), the symbol graph is sparse (each file typically references a handful of others, not hundreds), and PageRank on sparse graphs converges in tens of iterations. The practical limit is not the repomap itself but the LLM's context budget. A 1M-line repo will parse, but the top-N files that fit in a 200K-token context window may not include every file the model would ideally see. At that scale, you supplement the repomap with explicit file-mentioning in the prompt to bias the personalization vector.