How Aider's repomap uses PageRank to rank your codebase

Part of the Claude Code Engineering series Post 4 of 9

Aider knows which files to include in its prompt without you naming them. Type “fix the bug in the checkout flow” and the tool pulls the cart component, the checkout API handler, and the relevant tests, while leaving the marketing pages out. The data structure behind this is the repomap, and the algorithm that ranks files inside it is personalized PageRank running on a tree-sitter-parsed symbol graph. This is a tour through how that pipeline works.

This is a deep-dive on Truth #4 from How AI coding tools actually edit code: 6 truths from 4,000+ hours of building software, and pairs with the post on diff vs full-file rewrites.

Why every AI coding tool has a context-budget problem

LLMs have context windows. Claude Sonnet’s standard window is 200K tokens. A typical SaaS codebase of 50-100 files in TypeScript or Python is 30,000-80,000 lines, which expands to roughly 100,000-250,000 tokens when fully tokenized. So even small codebases either hit the limit or come close enough that you can’t afford to load everything.

The question on every prompt: which subset of the repo does the model actually see? The naive answer (“the files the user named”) breaks the moment someone says “fix the bug in checkout” without specifying which files implement it.

Four approaches have emerged across the tools:

Vector-embedding semantic search (Cursor, Sourcegraph, GitHub Copilot Workspaces): files are chunked and embedded; the prompt is embedded; the closest chunks by cosine similarity are loaded.
Agentic file discovery (Claude Code in agentic mode, OpenAI’s coding agents): the agent runs grep, glob, and find to discover relevant files, reading the output and asking for more.
Symbol-graph ranking (Aider, RepoMapper, several MCP servers): the repo is parsed into an AST, symbol references become graph edges, and a graph-ranking algorithm scores files by structural prominence.

Aider took the third path.

Parse first: what tree-sitter actually does here

Abstract syntax tree (AST): a tree representation of source code where each node is a syntactic construct (function, class, import, call site). Unlike raw text, an AST lets a tool ask “what does this function call?” rather than “does the string processPayment appear in this file?”

The repomap pipeline starts by walking the repo and parsing every supported file through tree-sitter. Three properties make tree-sitter the right choice:

Multi-language coverage. Tree-sitter ships parsers for 130+ languages. The same repomap code path handles Python, Rust, TypeScript, Java, and Go without language-specific branches.

Incremental parsing. Tree-sitter can re-parse a file given only the changed bytes. On every chat turn, this is what keeps the pipeline fast rather than glacial.

Fault tolerance. Tree-sitter produces a partial AST for syntactically broken code. This matters because Aider operates on working-tree state. Files you are mid-edit, with missing braces or incomplete type annotations, still get indexed.

For each file, Aider walks the AST and extracts definitions (functions, classes, exported constants) and references (call sites, type usages, imports). These become the vocabulary for the graph.

How the symbol graph is built

Symbol graph: a directed graph where nodes are source files and edges represent symbol references. An edge from file A to file B means A references a symbol defined in B.

Once the AST extraction is done, Aider constructs the graph. Nodes are files. Edges are reference relationships between files. The edge weights are where the interesting logic lives:

References to identifiers the user explicitly named in chat get a 10x multiplier.
References to long, specific identifiers (e.g., processStripeWebhook rather than helper) get a 10x multiplier, on the theory that specificity encodes more information about the relationship.
Outgoing edges from files already added to the chat get a 50x multiplier, biasing PageRank toward their neighborhoods.

For a 100-file repo, this graph has roughly 100 nodes and a few hundred to a few thousand edges depending on how interconnected the codebase is. Sparse, weighted, directed.

Why PageRank solves this problem

PageRank: a graph-ranking algorithm introduced by Larry Page and Sergey Brin in their 1998 paper. The intuition: a node is important if other important nodes point at it. Mathematically, it is the stationary distribution of a damped random walk on the graph.

The structural parallel between web ranking and codebase ranking is exact:

Web (1998)	Codebase
Pages	Files
Hyperlinks	Symbol references
A page is important if important pages link to it	A file is important if important files reference its symbols
Search query	Chat prompt

PageRank converges in tens of iterations even on large graphs. For a codebase graph in the hundreds-to-low-thousands of nodes, that is sub-millisecond. It scales linearly with graph size, and it is well-behaved on noisy, partial, or cyclic graphs. Real codebases are all three.

Aider’s implementation uses NetworkX’s PageRank. No machine learning. Standard graph math from a well-maintained library.

Personalized PageRank: how your chat shapes the ranking

Plain PageRank produces a single ranking that is a property of the graph alone. The same repo always produces the same ranking, regardless of what you are asking. That is not what Aider wants.

Personalized PageRank: an extension of PageRank where instead of equal restart probability across all nodes, you supply a bias vector that weights specific nodes higher. The algorithm then converges to a ranking favoring graph neighborhoods near those nodes.

Aider builds this vector per chat turn. Files already added to the chat get high restart weight. Identifiers mentioned in the prompt boost restart weight on files that define or heavily reference those identifiers. Recent file edits get a smaller boost.

The result: “show me where the auth flow lives” produces a ranking dominated by files structurally near auth-related symbols, even if those files are buried ten directories deep. The algorithm then walks the ranked list and includes files up to a configurable token budget. Default is around 1K tokens of repomap summary, but it is tunable.

What Cursor and Claude Code do instead

Cursor’s context selection leans on vector embeddings and lets the agent open files it predicts will be relevant. The trade-off is opacity: you cannot dump the embedding-space neighborhood and reason about why a file made or missed the cut.

Claude Code in agentic mode reads files explicitly via shell commands. You can watch it type grep -r "authenticate" src/ and see which files come back. Transparent, but every file read costs tokens and a turn.

Aider’s PageRank repomap sits between these. More transparent than embeddings: you can dump the repomap and inspect exactly what the model saw. More deterministic than agentic search: the same repo and same prompt produce the same ranking every time. For debugging “why did the model miss file X,” that property is valuable.

The genuine limitation: PageRank ranks files by topology, not semantic similarity. Cursor can surface a conceptually related file that shares no direct symbol references with the files you mentioned. Aider cannot, unless you name the right identifiers.

Three habits that improve repomap quality

The edge-weight formula means your naming choices affect what the model sees:

Use long, specific identifiers. The 10x multiplier applies to well-named identifiers. processStripeWebhook is more discoverable than handle. This is good naming hygiene anyway.
Name identifiers in your prompt when the question is structurally ambiguous. “Fix the bug in the checkout flow” gives PageRank nothing to personalize on. “Fix the bug in processCheckoutPayment” gives it a 10x edge-weight boost on every reference to that symbol.
Add the key file explicitly when you know it. The 50x multiplier on chat-included files dominates the ranking. If you know the auth bug is in auth/oauth.ts, add it with /add and let PageRank propagate from there.

The 90s gave us search; the 20s gave us this

When I worked through Aider’s repomap source for the first time, the parallel hit hard. The 1998 paper that introduced PageRank for the web is intellectually the same paper Aider is using to rank your codebase. Same math, different graph, twenty-six years later.

AI coding tools are not magic. The infrastructure under them is well-understood algorithms applied to a new domain.

If you are using AI coding tools on production codebases and want a second pair of eyes on what your tool’s context-selection layer is actually doing, let’s talk.

How AI coding tools actually edit code: 6 truths from 4,000+ hours of building software: the master post this is a deep-dive of.
Why AI coding tools rewrite full files instead of using diffs: the companion deep-dive on Truth #2.

Frequently asked questions

What is Aider's repomap, and why does it exist?

Aider's repomap is a token-budgeted summary of the most important code entities in your repository, generated automatically before every chat turn. It exists because LLMs have context windows (200K tokens for Claude, similar for GPT) and even small codebases of 50-100 files exceed that budget when fully expanded. Aider needs to choose which files to include in the prompt, and the repomap is the data structure that drives that choice. It is regenerated as files change, scoped to the active repo, and cached between turns to keep performance acceptable.

Why does Aider use PageRank specifically?

Because the codebase-ranking problem is structurally identical to the web-ranking problem Google solved in 1998. Both are directed graphs: files reference each other via symbols, web pages reference each other via hyperlinks. Both need to score importance where the importance of a node depends on the importance of the nodes pointing at it. Both benefit from personalization. PageRank's mathematical properties, convergence, scale-invariance, and robustness to graph structure, carry over cleanly. Aider uses NetworkX's PageRank implementation under the hood.

How does tree-sitter fit into Aider's repomap?

Tree-sitter is the parser that turns source files into ASTs so Aider can extract symbol definitions and references. Aider supports 130+ languages through tree-sitter parsers, which means the same repomap pipeline works on Python, Rust, TypeScript, Java, Go, and dozens more without language-specific code. Tree-sitter is incremental (re-parses only what changed) and fault-tolerant (produces partial ASTs for broken code). For Aider's use case, running on every chat turn against working-tree code that may not yet compile, these properties are non-negotiable.

What is the personalization vector in Aider's PageRank?

Standard PageRank assigns equal restart probability to every node. Personalized PageRank biases the restart probability toward a chosen subset, so the algorithm converges to a ranking that favors graph neighborhoods near those nodes. Aider's personalization vector weights identifiers the user explicitly mentioned (10x), well-named identifiers like long and specific function names (10x), and files already in the chat (50x). The result is that "show me where the auth flow lives" produces a ranking dominated by files that reference auth-related symbols, even if those files are buried deep in the repo.

How is Aider's repomap different from Cursor's or Claude Code's context selection?

Cursor relies more on vector embeddings for semantic search and lets the agent open files it predicts will be relevant, with the IDE feeding relevant snippets back to the model. Claude Code in agentic mode reads files explicitly via grep and glob commands, producing a discovery process visible to the user. Aider's PageRank repomap is the most deterministic of the three: the same repo and same chat produce the same ranking, and you can dump the repomap to inspect exactly what the model saw. The trade-off is that PageRank ranks by topology, not semantic similarity, so the other tools can surface conceptually related code that Aider's graph would miss.

How big a codebase can Aider's repomap handle?

Aider's repomap scales well into the hundreds of thousands of lines of code. Tree-sitter parsing is fast (sub-second on most repos), the symbol graph is sparse (each file typically references a handful of others), and PageRank on sparse graphs converges in tens of iterations. The practical limit is not the repomap itself but the LLM's context budget. A 1M-line repo will parse, but the top-N files that fit in a 200K-token window may not include every file the model would ideally see. At that scale, you supplement the repomap with explicit file-mentioning in the prompt to bias the personalization vector.

Tagged

How Aider's repomap uses PageRank and tree-sitter to rank your codebase

Why every AI coding tool has a context-budget problem

Parse first: what tree-sitter actually does here

How the symbol graph is built

Why PageRank solves this problem

Personalized PageRank: how your chat shapes the ranking

What Cursor and Claude Code do instead

Three habits that improve repomap quality

The 90s gave us search; the 20s gave us this

Frequently asked questions

More from this blog

Why AI coding tools rewrite full files instead of using diffs

How AI coding tools edit code: 6 things from shipping software with real users

Phased migrations with per-phase verification gates

Why every AI coding tool has a context-budget problem

Parse first: what tree-sitter actually does here

How the symbol graph is built

Why PageRank solves this problem

Personalized PageRank: how your chat shapes the ranking

What Cursor and Claude Code do instead

Three habits that improve repomap quality

The 90s gave us search; the 20s gave us this

Related reading

Frequently asked questions

More from this blog

Why AI coding tools rewrite full files instead of using diffs

How AI coding tools edit code: 6 things from shipping software with real users

Phased migrations with per-phase verification gates