Skip to the content.

Recipe: executing the 30-day plan with an LLM coding agent

This page is process meta-content - it documents how this repository was built. If you are here to read about world-model evaluation, the technical pages are 00 - Thesis through 06 - Reading a scorecard. This page is for someone who wants to use a similar recipe on a different project.

This repository was built end-to-end with an LLM coding agent in the loop (Claude Code in this case; Codex, Cursor, Aider, or any equivalent agent that can read files, run shell, edit code, and spawn sub-agents works the same way). The bottleneck was never “the model cannot code”. It was specification clarity and review discipline.

The recipe below is reproducible. Each step has been used to ship the v0.3 → v0.7 cycle of this repo.

0. Setup

1. Hand it the week, not the task

Each week of the plan is a stand-alone brief. Drop the entire week’s section into the agent verbatim and ask:

Plan the implementation. List every file you will create or modify. Do not write code yet. When you list a metric or a test, state the invariant it locks in.

Approve the plan only when it lists the same files you would. If the agent proposes “improvements” outside the week’s scope, push back. Scope creep is the failure mode, not under-delivery.

2. Implementation in one round

Once the plan is approved:

Implement the plan. Write tests alongside each new module. After every file is written, run pytest -q and any example scripts the README mentions. Report failures rather than working around them.

A good agent will use a visible todo list, run tests before claiming success, and surface unexpected behaviour rather than papering over it.

3. Pre-tag adversarial review (the step that paid off 4-for-4)

Before tagging any release, spawn an independent review agent with a fresh context. The independence is what makes this work; the agent does not inherit your implementation rationalisations.

Prompt template:

Adversarially review the diff at HEAD against the brief above. The previous review caught [insert prior failure modes; for this repo it was per-call vs per-episode latency confusion, perturbation accounting overcounting the denominator, dead reporting paths, fragile test heuristics, missing invariants]. Look specifically for: math errors, doc/code mismatches, missing invariants, silent fallbacks, performance regressions. If you find nothing, say so explicitly. Output severity (critical / major / minor) | one-line description | file:line evidence | suggested fix.

Track record on this repo:

Zero releases shipped with metric-correctness bugs after this pattern was adopted. The review usually catches one finding that would have been embarrassing in public.

4. Tag, push, release

Only after review findings are addressed. Each tag corresponds to a working green-CI state. Release notes are drafted by the agent from the commit history plus the relevant doc-page updates, then read over by a human before publishing.

5. Soft passes (no version bump)

Documentation polishing, vocabulary tightening, rebranding, design-system changes, and Pages-only updates are doc-only passes. They do not need a version bump. They still deserve a single self-contained commit message that explains the why.

This repo has shipped several: the “soften framing” pass, the GitHub Pages landing, the IBM Plex design system, the interactive hero, the formula popovers. None bumped the version.

What to keep human-in-the-loop

The agent should not autonomously decide:

Anti-patterns this recipe explicitly avoids

Reading order if you are setting this recipe up on a new project

  1. The technical pages 00 - Thesis06 - Reading a scorecard.
  2. AGENTS.md (hard rules: scope, dependencies, non-affiliation, style).
  3. CONTRIBUTING.md (how a new metric / benchmark card / adapter / perturbation should be structured).
  4. One existing release commit message in git log (for example v0.6.0 or v0.7.0) to calibrate the level of detail the agent should produce.

How long does it actually take?

Per week, one focused sitting end-to-end (plan, implement, review, fix, commit, tag). Most of the wall-clock time is test runs and waiting on review cycles, not model think-time. The recipe favours fewer, deeper cycles over many shallow ones - exactly the discipline that catches metric-correctness bugs early.