Skip to the content.

05 - 30-Day Prototype Plan

A focused four-week plan that takes the repository from “credible scaffolding” to “shareable applied narrative”.

The plan assumes one contributor, part-time. Every week ends with a runnable artifact, not just a doc.


Week 1 - Repository and toy benchmark

Goal: a public-facing scaffold a stranger can clone, install, and run on a laptop.

Exit criterion: pytest is green and python -m examples.two_room_toy.run_baseline prints a scorecard from a clean checkout.


Week 2 - Adapter interface and scorecards

Goal: prove the evaluation layer is model-agnostic.

Exit criterion: a developer can implement a new PlannerPolicy in under 50 lines and produce a scorecard without touching the rest of the codebase.

Status: shipped in v0.2 - see examples/maze_toy/ and src/wmel/adapters/tabular_world_model.py. BenchmarkEnvironment now exposes action_space. TabularWorldModelPlanner is a concrete LeWMAdapterStub subclass that fills in encode, rollout, score, and plan end-to-end with no third-party dependency.


Week 3 - Baseline comparison and reporting

Goal: make the comparison story sharp.

Exit criterion: a single command produces a Markdown report comparing two policies on two environments with confidence intervals.

Status (v0.4):

What remains for v0.5: a pluggable perturbation library (displacement, blocked-cell, delayed-action) and a small CLI front-end.

Status (v0.5): perturbation library shipped. wmel.perturbations defines a Perturbation ABC with two override hooks (apply_to_env, transform_actions) and three concrete subclasses: EnvPerturbation (delegates to env.perturb(), the runner’s default), DropNextActions(k) (action-level drop), and CompositePerturbation(*parts) (composable). The runner’s inner loop was refactored to a deque-based action queue so action-level perturbations are O(1). Scorecard.perturbation_name records the strategy. The CLI front-end remains for a later release.


Week 4 - Public demo and applied narrative

Goal: make the artifact persuasive to a non-researcher.

Exit criterion: a non-researcher can read the README, run the demo, and articulate the thesis without help.

Status: docs/06_demo.md is shipped (a row-by-row product walkthrough of the maze horizon sweep). CONTRIBUTING.md is shipped. Tagged releases are at v0.3.1 and v0.4.0 with explicit non-affiliation disclaimers. CI runs the suite plus a smoke test of the three example scripts on Python 3.11/3.12/3.13. Screen capture remains optional and is not done.


What is explicitly out of scope for the first 30 days

These are deliberate omissions. The whole point of this study is that the evaluation layer is what is missing, not yet another model.


Recipe for executing this plan with an LLM coding agent

The full recipe (setup, per-week loop, pre-tag adversarial review pattern, anti-patterns) is now its own page so it does not crowd the technical study plan: see process/llm_agent_recipe.html. The track record on this repo: zero releases shipped with metric-correctness bugs after the review pattern was adopted; 4 review passes caught 1 critical, 8 majors, and ~15 minors before they reached a tag.