✉️ Why LLMs Write Code More Easily Than They Modify It?

Olivier Vitrac, PhD, HDR — Adservio Innovation Lab November 2025

Abstract

Large Language Models (LLMs) demonstrate a clear asymmetry between generation and modification tasks. They can generate code fluently from concise specifications, yet they struggle to revise or refactor large, structured codebases. This limitation is not merely practical — it is theoretical: editing involves higher information entropy and conditional complexity than writing from scratch.

In short, writing is a linear act of construction, whereas editing is a branched act of reconstruction. It requires maintaining the coherence of dependencies, names, and states — comparable to reweaving a Turing machine’s tape rather than writing it anew.

– In simpler words –

LLMs are brilliant architects but clumsy electricians: they design clean new systems from short briefs but struggle to rewire existing ones without tripping over their dependencies.

This asymmetry stems from information entropy and cognitive load, not raw computational power. It reflects a fundamental constraint rooted in computation theory and verified experimentally across recent benchmarks.

1. From a coding experience to complexity theory

💡 NOTE: Two formal measures of complexity are used in information theory: entropy and conditional Kolmogorov complexity. Both are defined and illustrated in Appendices A and B. Before discussing them, it is useful to see the problem.

Let us consider two very small programs: one extended sequentially, the other edited internally. Both end up producing the same visible effect, yet their token-level complexity for an LLM is drastically different.

1.1 Minimal illustration: extension vs. revision

Case A – Extension (linear writing)


xxxxxxxxxx
# Program A: sequential extension
print("Hello, world!")
print("Welcome to Adservio Lab.")
print("Enjoy your day.")

Let the original file contain m = 1 line, and we append n = 2 lines. Tokenization (GPT-2-style, approximate):

Line	Code fragment	Tokens
1	`print("Hello, world!")`	5
2	`print("Welcome to Adservio Lab.")`	7
3	`print("Enjoy your day.")`	5
Total		17 tokens

The model performs pure linear generation $H(P') \approx n,H(X)$ .

Case B – Revision (contextual editing)


xxxxxxxxxx
# Program B: internal revision
names = ["Alice", "Bob", "Charlie"]
for name in names:
    if name.startswith("A"):
        print(f"Hello, {name}!")
    else:
        print("Welcome to Adservio Lab.")

Here, the final output is similar (greetings), but the operation is an edit of Program A: it introduces a loop, branching, and state variables.

Approximate tokenization:

Code fragment	Tokens	Contextual links
`names = ["Alice", "Bob", "Charlie"]`	9	introduces variable `names`
`for name in names:`	6	depends on `names`
`if name.startswith("A"):`	8	adds conditional branch
`print(f"Hello, {name}!")`	9	depends on branch variable
`else:`	1	contextual token
`print("Welcome to Adservio Lab.")`	7	reused literal
Total	≈ 40 tokens	multiple cross-dependencies

Although the visible code only doubled, the effective token count more than doubles, and several tokens now carry contextual meaningconditional entropy $H(P'\mid P)$ conditional complexity $K(P'\mid P)$ .

1.2 Immediate observation

Appending n lines to an m-line codebase mainly increases the lexical sequence length. Editing n lines inside an m-line codebase forces the model to re-interpret all tokens that might depend on the modified region. Hence, although fewer characters are produced, more information is processed.

\begin{matrix} (1) & K (P^{'} | spec) \propto n, whereas K (P^{'} | P) \propto n \log (dependencies) . \end{matrix}

where

$P$ denotes an existing program (the current codebase).
$P'$ is the target program (the new or modified version).
$\text{spec}$ represents a fresh specification, i.e., a concise textual description of what to build.

1.3 From illustration to general principles

Figure 1 contrasts the cognitive and computational asymmetry between writing a few lines of code and revising those same lines within a complex environment. The discrepancy arises from the extra information — i.e., additional tokens — required to describe what must be changed, where, and how dependencies are preserved.

Figure 1 — Entropy in code generation vs. modification

LLMs consume more tokens to maintain coherence than to produce text. Editing forces them to recompute positional, syntactic, and semantic dependencies—an operation that scales faster than the visible diff.

The remainder of this note generalizes this observation. Sections 2–4 and the appendices formalize it using information entropy and conditional Kolmogorov complexity, providing a quantitative basis—and thermodynamic analogy—for the energetic cost of reasoning during code modification.

1.4 What complexity theory tells us

writes from scratch $P'$ $\text{spec}$ unconditional complexity $K(P'\mid \text{spec})$ . edits existing code $P$ $P'$ conditional complexity $K(P'\mid P)$ — how much new information must be injected while preserving all prior constraints.

In simple terms:

Generation = write everything anew → linear reasoning. Editing = modify while preserving coherence → contextual reasoning.

Information-theoretic asymmetry conditional Kolmogorov complexity $K(P'\mid P)$ $P$ $P'$ $K(P' \mid P)$ $K(P' \mid \text{spec})$ (from a blank specification) remains low.
Editing is nonlinear: its cost scales with the entropy of the dependency graph, not merely with the size of the change.
Figure 2 summarizes this behavior:
- As dependency densityminimal description length $K(P'\mid P)$ grows linearly or faster.
- The corresponding token cost (context + generation) grows in parallel.
- Editing reliability decreases roughly inversely with conditional complexity as attention and memory saturate.
Time–memory duality $O(n^2)$ attention). Maintaining long-range dependencies across files, classes, or configurations demands both more tokens and more memory — echoing the classical time–space trade-off formalized by Hartmanis & Stearns (1965).
Software-evolution entropy Empirical studies show that source-code entropy spikes during major refactors or architectural shifts. These are precisely the conditions under which LLMs falter: high entropy yields unpredictable propagation of changes and reduced determinism in dependency resolution.

Figure 2 — Evolution of relative conditional Kolmogorov complexity K(P'|P) (orange), token cost (blue), and editing reliability (green) as repository coupling and dependency density increase.

2. What experiments show (empirical evidence made simple)

The theoretical arguments can be tested in practice. Over the last two years, several research teams have built benchmarks—large collections of real programming tasks—to measure how well LLMs perform when generating, debugging, or editing code. The table below summarizes five of the most widely cited ones.

Benchmark	What it tests	Key finding
HumanEval	Small, isolated coding tasks such as “write a function that reverses a string.”	LLMs perform very well (above 85% correct for GPT-3.5/4). These tasks require no external context.
SWE-bench	Real-world bug fixes drawn from open-source repositories.	Success rates drop below 25%, even for top models. Once multiple files and dependencies are involved, reasoning collapses.
RepoBench (ICLR 2024)	Understanding and editing entire repositories rather than single files.	Performance decreases sharply with project size and cross-file links.
CodePlan (TOSEM 2024)	Planning multi-step code edits (understanding, proposing, modifying, and verifying).	Models must “think in steps.” Without planning or memory, they get lost mid-edit.
Lost in the Middle (TACL 2024)	How well models use very long contexts (tens of thousands of tokens).	Models tend to ignore information located in the middle of long inputs—critical for editing long codebases.

In plain terms: Models are great when they can focus on a single self-contained problem (a function, a paragraph, an equation). They struggle as soon as they must reason about interconnections—the very fabric of software engineering.

These results empirically confirm what complexity theory predicts:

The more intertwined the context, the more information must be recalled, recomputed, and rewritten—raising both entropy and computational cost.

3. Conceptual analogy (why it feels harder)

The difference between writing and editing large codebases can be understood through analogies that bridge computer science, physics, and engineering.

Perspective	Linear generation	Repository-level editing
Turing machine	Writing tape sequentially—each symbol depends only on the previous one.	Rewriting linked cells while preserving state—one change ripples through the whole tape.
Transformer attention	Sparse and local: focus on a few relevant tokens.	Dense and global: attention must cover many tokens and dependencies at once.
Information flow	Low entropy—information flows in one direction.	$H(P'!\mid!P)$ —information must be preserved and recombined.
Engineering metaphor	Drafting a clean new blueprint.	Rewiring an entire factory while it’s still running.

Reading this table

Each row is a way of describing the same asymmetry:

Sequential systems (writing) move forward smoothly: one decision after another.
Contextual systems (editing) are constrained backward and sideways by everything that already exists.

From a thermodynamic perspective, editing is like maintaining order in a system full of moving parts: it requires energy just to avoid chaos.

4. Practical implications for AI-driven development

1. Strategic level — Favor modular regeneration

Instead of asking an LLM to rewrite existing code line by line, it is often more efficient to generate a clean replacement module $K(P' \mid P)$ low and avoids dependency explosions.

2. Architectural level — Control the context

Enhance editing workflows with:

Retrieval-augmented editing (RAE): dynamically fetch only the parts of the repository relevant to the change.
Repository-graph embeddings: pre-compute dependency maps so the model “sees” structure, not just raw text. Both methods reduce unnecessary token consumption and memory use.

3. Operational level — Plan, then edit, then validate

Adopt a three-step loop:

Plan — identify what must change and which files are involved.
Edit locally — apply minimal, well-scoped changes.
Validate globally — run tests or consistency checks.

This sequence prevents entropy from spreading through the system—exactly as a good thermodynamic process prevents heat loss.

Closing insight

All these principles exploit the same asymmetry:

Generation lowers entropy by building from first principles, Editing increases entropy by juggling dependencies.

Minimizing the number and scope of edits is therefore not only good software practice—it is good energy practice as well.

5. Conclusion and takeaways

Conclusion — Complexity-Conservation Principle

Every additional token has a cost — not only in computation, but in energy. When an LLM edits existing code, it must re-evaluate dependencies, positions, and states: this increases the conditional complexity K(P'|P) and consumes disproportionately more resources than simple linear generation. In information-theoretic terms, added entropy becomes extra work; in thermodynamic terms, repeated unnecessary edits accumulate as heat and emissions.

At scale, millions of avoidable “just one more edit” requests translate into significant power usage. Editing is therefore not only a matter of correctness or productivity — it is also a matter of computational and environmental responsibility.

“Ask only when entropy deserves it.”

Complexity–Conservation Rule (entropy-aware editing)

# Before requesting or applying an edit:
# Estimate ΔK = K(P'|P)_after - K(P'|P)_before per edited token.
if ΔK_per_token > 0:
    reject_or_rethink_edit()   # edit increases global coupling / dependencies
else:
    perform_edit()             # edit simplifies, modularizes, or localizes effects

Or in natural language, suitable for both humans and LLMs:
“Does this edit reduce dependencies, or just shift them?
Will it increase K(P'|P) or reduce it?”

In other words, we cannot enforce strict conservation — the second thermodynamical law still holds — but we can enforce a design bias: only edits that lower conditional complexity should be favored and automated. Anything else is not only bad engineering; it is wasted entropy on the planetary budget.

References

Hartmanis J., Stearns R. E. (1965). On the Computational Complexity of Algorithms. Trans. AMS, 117, 285–306. doi:10.2307/1994208
Vitányi P. M. B. (2022). Information, Complexity, and Meaning. Springer.
Jain S. et al. (2023). SWE-bench: Can LLMs Fix Real Bugs?. NeurIPS 2023.
Zhu Z. et al. (2024). RepoBench: A Repository-Level Benchmark for LLMs.. ICLR 2024.
Bairi R. et al. (2023). CodePlan: Repository-Level Code Editing with LLMs.. ACM TOSEM.
Liu N. et al. (2024). Lost in the Middle: LLMs Struggle with Long Contexts.. TACL.
Torres R. et al. (2022). Entropy of Source Code as a Predictor of Software Evolution. Empirical Software Engineering, 27, 45.
Dao T. et al. (2023). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135.

Appendix A. Entropy and its meanings

A.1 Formal definition and link to token complexity

entropy $X$ $X$ :

\begin{matrix} (2) & H (X) = - \sum_{i} p_{i} \log_{2} p_{i} [bits] \end{matrix}

$X$ $p_i$ $i$ . The expected token cost is then:

\begin{matrix} (3) & Bits per token = H (X) \end{matrix}

information load $n$ tokens as:

\begin{matrix} (4) & Information load \approx n, H (X) bits. \end{matrix}

When the model must reason about positions and dependencies (cross-file references, scopes, imports), the entropy grows because each token depends on a larger conditional context. This corresponds to conditional entropy:

\begin{matrix} (5) & H (P^{'} | P) = H (P^{'}, P) - H (P), \end{matrix}

$P'$ $P$ $H(P'|P)$ rises super-linearly: more tokens are consumed merely to re-express known structure and maintain positional consistency.

A.2 Concrete example with text and token counts

Consider the English word sequence:

“The cat sleeps.”

Using a standard tokenizer (e.g., GPT-2 BPE), it is 4 tokens: ["The", " cat", " sleeps", "."]

If we want to generate this sentence from scratch, the information load is roughly:

Each token has entropy ≈ 6 bits → total ≈ 24 bits.

Now imagine an edit requiring insertion of an adjective (“black”) between The and cat, plus agreement on verb tense:

“The black cat was sleeping.”

The model must:

Insert two tokens (" black", " was") in correct order → additional content entropy ≈ 12 bits.
Re-encode all subsequent tokens with updated positions → each positional vector (≈ 8 bytes/token) must be recomputed.
Propagate tense consistency (“sleeps” → “was sleeping”) → another 8 bits of conditional decision entropy.

Thus, even for this tiny edit, total information load nearly doubles (from 24 → ~44 bits) and positional recomputation affects all following tokens. In long documents—or codebases with hundreds of linked identifiers—the same proportional inflation applies: entropy compounds with the number of tokens that must stay coherent.

A.3 Common-sense interpretation

Entropy measures how many questions must be answered to make something unambiguous.

Low-entropy writing: starting from nothing, each new token narrows uncertainty; once written, it fixes its own structure.
High-entropy editing: changing one token reopens many questions—where it fits, what it breaks, and how to reconcile all references.

In human terms, entropy is the cognitive and computational load of preserving order while changing detail.

A.4 Practical link to LLM cost

Situation	Typical tokens processed	Effective entropy	Computational implication
Generate new file from prompt	200–800	Low (few dependencies)	Fast, cheap inference
Edit function with cross-refs	2 000–8 000	Moderate	Quadratic attention cost
Refactor multi-module repo	20 000 +	$H(P'\mid P)$	Very expensive and slow

Hence, entropy translates directly into token count × bits per token, which drives the energy, time, and memory required by the model. Large or branched edits therefore consume far more computational entropy than linear code generation.

$K(P'\mid P)$ and its practical meaning

B.1 Formal definition

$K(X)$ $X$ on a universal Turing machine. conditional Kolmogorov complexity $P'$ $P$ is defined as:

\begin{matrix} (6) & K (P^{'} ∣ P) = min_{π}, | π | : U (π, P) = P^{'},, \end{matrix}

$U$ $|\pi|$ $\pi$ edit program $P$ $P'$ .

Intuitively:

$K(P')$ = total information required to write the new program from scratch.
$K(P'|P)$ incremental information $P$ $P'$ .

$K(P'|P)$ $K(P')$ ; that is, editing may be as hard as rewriting.

B.2 Relationship with entropy and tokens

Shannon entropy and Kolmogorov complexity coincide in expectation for computable sources:

\begin{matrix} (7) & E [K (P)] \approx H (P) . \end{matrix}

$K(P'|P)$ as the expected number of bits (or tokens) the model must process to make all dependent edits consistent. When a codebase is large, the minimal “edit program” grows because:

Dependencies must be re-specified explicitly in text (imports, signatures, docstrings).
Positional changes cascade—adding one symbol often forces dozens of updated references.
The model must check consistency (syntactic and semantic), implying extra reasoning tokens.

$K(P'|P)$ directly scales with the token budget required for context + generation + validation.

B.3 Example: small vs. coupled edit

Consider two tasks:

Local edit: change a numerical constant


xxxxxxxxxx
# Original
threshold = 0.95
# Modified
threshold = 0.97

Minimal description length: “replace 0.95 → 0.97” (≈ 3 tokens).
$K(P'|P)$ is constant, almost independent of file size.

Coupled refactor: rename a class and propagate it
```
xxxxxxxxxx
# Before
class UserSession: ...
# After
class AuthSession: ...
```
- All calls (UserSession()), docstrings, imports, tests, and configuration keys must change.
- For a 50 k-token repository, the LLM must locate and regenerate 500–2000 token spans with consistent semantics.
- $K(P'|P)$ $O(\text{affected tokens × log dependencies})$ → often thousands of tokens.

$K(P')$ $K(P'|P)$ (editing across dependencies) can exceed 2000 tokens—an order of magnitude more information.

B.4 Operational analogy for LLMs

Operation	Description	Approx. computational complexity
Generate a new file from prompt	The model writes code directly from a specification; there are few prior constraints or dependencies.	$K(P' \mid \text{spec}) \propto O(1)$ — linear growth with sequence length only.
Edit a single isolated function	The model must reason locally within a bounded scope and preserve syntax; minimal propagation of side effects.	$K(P' \mid P) \propto O(\text{local scope})$ — modest increase with number of dependent tokens.
Refactor interdependent modules	The edit touches multiple files, type hierarchies, or APIs; requires cross-module reasoning and positional re-encoding.	$K(P' \mid P) \propto O(n \log n)$ — super-linear growth with repository size and coupling.
System-wide migration (e.g., API version bump)	The entire repository must remain consistent; imports, configs, and tests are rewritten coherently.	$K(P' \mid P) \approx K(P')$ — editing cost approaches that of rewriting from scratch.

This progression formalizes why LLMs lose efficiency when editing complex systems: the edit description itself becomes as large as the new code.

B.5 Common-sense interpretation

Kolmogorov complexity measures compressibility of transformation $P$ $P'$ $K(P'|P)$ $K(P'|P)$ ).

$K(P'|P)$ means:

More tokens must be re-emitted (higher compute cost).
Longer context windows are required (memory cost).
Higher risk of coherence loss or hallucinated rewrites (entropy cost).

B.6 Linking back to Appendix A

$H(P'|P)$ uncertainty $K(P'|P)$ measures the shortest possible message describing it. Both are expressed in bits and, when mapped through a tokenizer, in expected token counts. The scaling relation is:

\begin{matrix} (8) & Expected tokens for edit \propto \frac{K (P^{'} | P)}{bits per token} . \end{matrix}

Therefore, high conditional complexity directly translates into larger prompts, longer inference times, and increased cost—precisely the symptoms observed when LLMs attempt large-scale code modifications.

Adversion Innovation Lab – nov 2025