Now I have everything I need. Let me write the spec.
autoresearch — Product Specification
Reverse-engineered from karpathy/autoresearch · April 2026
Problem Statement
Training a good language model involves thousands of small decisions — architecture tweaks, hyperparameter sweeps, optimizer changes, batch size experiments. A human researcher makes one change, waits for a training run to finish, reads the results, thinks about what to try next, and repeats. It's slow, because humans need to eat, sleep, and attend meetings. Most of the wall-clock time in a research loop is the researcher not being at the keyboard.
autoresearch asks: what if the AI agent is the researcher? Give it a real training setup on a real GPU, let it modify the code, run the experiment, check if results improved, and decide what to try next — all by itself, all night long. You go to sleep; you wake up to a log of 100 experiments and (hopefully) a measurably better model.
This isn't a training framework. It's a research automation protocol. The AI agent is the researcher, the human is the research director, and the program.md file is how the director communicates strategy.
Actors & Goals
The Human (Research Director)
- Sets the research direction by editing
program.md — a markdown file that tells the agent what to focus on, what constraints to obey, and how to evaluate success
- Kicks off the agent with a simple prompt, then walks away
- Returns later to review a timestamped log of experiments (kept vs. discarded vs. crashed) and a git branch showing the surviving improvements
- May adjust
program.md between sessions to steer the agent toward different research questions
The AI Agent (Autonomous Researcher)
- Reads the codebase and
program.md for context
- Proposes and implements code changes to the training script
- Runs each experiment, reads the results, and decides whether to keep or discard
- Manages git state: commits before each experiment, advances the branch on improvement, reverts on failure
- Logs every experiment to a results file
- Runs indefinitely without asking for permission to continue
The Training Script (train.py)
- Acts as the experimental substrate — the thing being modified
- Contains a complete GPT model definition, optimizer setup, and training loop
- Produces a single, standardized metric after each run
The Evaluation Harness (prepare.py)
- Provides the fixed ground truth: data loading, tokenization, and the evaluation metric
- Is explicitly read-only — neither the human nor the agent may modify it
- Ensures that all experiments are measured on the same yardstick
Operator Value
- Overnight research throughput: ~12 experiments per hour, ~100 per sleep cycle, all without human supervision
- Apples-to-apples comparison: Every experiment runs under the same time budget and is measured with the same metric, making results directly comparable regardless of what the agent changed
- Automatic curation: The agent keeps only improvements and reverts everything else, so the git branch always represents the best-known configuration
- Experiment journal: A TSV log provides a complete audit trail — what was tried, what the result was, whether it was kept — readable by humans and scripts alike
- Platform-optimal tuning: Because the time budget is wall-clock, the system naturally discovers the best model for your specific hardware in a fixed time window
- Research direction as code: The human's research strategy lives in a markdown file that can be versioned, shared, iterated, and compared — turning research management into a programmable artifact
Core Capabilities
1. Autonomous Experiment Loop
The system runs a continuous cycle: propose a change → implement it → run the experiment → evaluate the result → keep or discard → repeat. This loop runs indefinitely until manually stopped. The agent never pauses to ask the human for guidance.
2. Fixed-Budget Fair Comparison
Every experiment trains for exactly 5 minutes of wall-clock time (excluding startup and compilation overhead). This makes experiments comparable regardless of what the agent changed — whether it doubled the model size, halved the batch size, or swapped the optimizer. The constraint is time, not steps or tokens.
3. Single-Metric Evaluation
All experiments are judged by one number: validation bits per byte (val_bpb). Lower is better. This metric is vocabulary-size-independent, so architectural changes that modify the tokenizer interaction or vocab dimensions are still fairly compared. The evaluation harness is fixed and cannot be modified.
4. Git-Based Experiment Management
Each research session runs on a dedicated git branch (e.g., autoresearch/mar5). The agent commits before each experiment, advances the branch when results improve, and reverts to the last-known-good commit when they don't. The branch tip always represents the best result so far.
5. Structured Result Logging
Every experiment — successful, failed, or crashed — is logged to a TSV file with the commit hash, val_bpb score, peak memory usage, keep/discard/crash status, and a human-readable description of what was tried. This file is intentionally kept out of git (untracked) to avoid merge noise.
6. Research Direction via Markdown
The human programs the agent by editing program.md — a plain markdown file that provides context, constraints, and strategy. This is the interface between human intent and agent behavior. The human never touches Python directly; they write prose.
7. Crash Recovery and Resilience
When an experiment crashes (out-of-memory, bugs, numerical instability), the agent diagnoses the failure, logs it, reverts, and moves on. Simple mistakes (typos, missing imports) are fixed and retried. Fundamentally broken ideas are abandoned gracefully. The loop continues.
8. Simplicity-Weighted Decision Making
The agent doesn't just chase lower numbers. A marginal improvement that adds ugly complexity is not worth keeping. A marginal improvement from deleting code is a clear win. Equal performance with simpler code is a keep. The system values elegance alongside performance.
Observable Behaviors
Starting a Research Session
- Trigger: Human opens an AI coding agent in the repo directory and prompts it to read
program.md and begin
- Response: Agent proposes a run tag (e.g.,
mar5), creates a branch autoresearch/<tag>, reads all in-scope files, verifies data exists, initializes results.tsv, and confirms readiness
- Persistent effect: A new git branch is created, a results file is initialized
- Failure mode: If data/tokenizer files are missing, agent tells the human to run
uv run prepare.py
Running an Experiment
- Trigger: Agent modifies
train.py, commits, and executes uv run train.py > run.log 2>&1
- Response: Training runs for ~5 minutes, then prints a summary block with val_bpb, training time, peak VRAM, MFU, total tokens, step count, parameter count, and model depth
- Persistent effect: A
run.log file is written; a git commit exists with the experimental changes
- Failure mode: If training loss explodes or goes NaN, the script exits immediately with "FAIL". If the run exceeds 10 minutes, the agent kills it and treats it as a failure.
Evaluating and Deciding
- Trigger: Experiment completes (or crashes)
- Response: Agent extracts val_bpb from the log, compares against the previous best
- Persistent effect on improvement: Branch advances; result logged as "keep" in TSV
- Persistent effect on regression:
git reset to previous commit; result logged as "discard"
- Persistent effect on crash: Revert; result logged as "crash" with val_bpb of 0.000000
Data Preparation (One-Time)
- Trigger: Human runs
uv run prepare.py
- Response: Downloads training data shards from HuggingFace, trains a BPE tokenizer, builds a byte-length lookup table for BPB evaluation
- Persistent effect: Data cached in
~/.cache/autoresearch/ — shards, tokenizer pickle, token byte counts
- Failure mode: Retries downloads up to 5 times with exponential backoff; exits if tokenizer training has insufficient data
Manual Baseline Run
- Trigger: Human runs
uv run train.py directly
- Response: Full training run with progress logging (step number, loss, learning rate, tokens/sec, MFU, remaining time), followed by the standard summary block
- Persistent effect: None beyond the log output
- Failure mode: Same NaN/explosion detection as autonomous mode
Edge Cases
- Agent runs out of ideas: The agent is explicitly instructed to never stop. If it feels stuck, it should re-read the code, look at references, try combining near-misses, or attempt more radical changes. The loop runs until the human interrupts.
- Marginal improvement with high complexity: The simplicity criterion tells the agent to weigh complexity cost against improvement magnitude. A 0.001 bpb improvement from 20 lines of hacky code? Probably not worth it. The same improvement from deleting code? Definitely keep.
- Multiple near-identical results: When val_bpb is equal to the baseline (not strictly better), the change is discarded. The system has a "strictly better" threshold.
- VRAM pressure: VRAM is a soft constraint. Moderate increases are acceptable for meaningful bpb gains, but dramatic blowups (OOM) are treated as crashes.
- Compilation warmup: The first 10 training steps are excluded from the time budget to account for
torch.compile overhead. This prevents JIT compilation time from eating into the actual training budget.
- Garbage collection stalls: Python's garbage collector is frozen after the first step and only runs every 5,000 steps to avoid ~500ms stalls during training.
Non-Functional Constraints
- Hardware: Requires a single NVIDIA GPU. Tested on H100. No multi-GPU, no distributed training, no CPU or MPS support in the main repo.
- Runtime: Python 3.10+, managed via
uv (Astral's package manager). Dependencies pinned in pyproject.toml.
- Time budget: Hardcoded at 300 seconds (5 minutes) of wall-clock training time. Not configurable from
train.py — it's a fixed constant in prepare.py.
- Context length: Fixed at 2048 tokens.
- Vocabulary: 8,192 BPE tokens + 4 special tokens. Trained from the dataset using
rustbpe.
- Dataset: climbmix-400b-shuffle from HuggingFace, with a pinned validation shard (shard 6542).
- Evaluation volume: ~20 million tokens evaluated for validation BPB (40 × 524,288 tokens).
- Precision: Training runs in bfloat16 with high-precision float32 matmul settings.
Non-Goals
- Not a training framework: This is not a general-purpose LLM training toolkit. It's a research acceleration protocol with a very specific loop.
- Not multi-GPU: Deliberately single-GPU, single-file, single-metric. Distributed training is out of scope.
- Not cross-platform: NVIDIA GPUs only in the main repo. Mac/AMD/Windows support is deferred to community forks.
- Not a hyperparameter search tool: The agent isn't doing grid search or Bayesian optimization. It's doing open-ended research — architectural changes, optimizer swaps, creative ideas. The search space is the entire training script.
- Not reproducible across hardware: Because the time budget is wall-clock, results are platform-specific. An H100 run and an RTX 4090 run will produce different "best" configurations. This is by design — the system optimizes for your hardware.
- Not a benchmarking suite: The val_bpb metric exists to compare experiments within a session, not to produce publishable benchmark numbers.
- No human-in-the-loop during execution: Once started, the agent runs autonomously. There is no approval step, no "should I continue?" prompt, no checkpoint review. The human's only control is to stop the process.
Suspected Implementation Leakage
The following observations are about how autoresearch is built rather than what it promises. They're noted here for completeness but belong in a technical spec:
- The model architecture is a GPT variant with RoPE embeddings, RMS normalization, value residual connections (ResFormer-style), a SwiGLU-like MLP with squared ReLU, and logit soft-capping at 15
- The optimizer is a custom MuonAdamW hybrid — Muon (with polar express orthogonalization) for 2D matrix parameters, AdamW for everything else (embeddings, scalars, unembedding head)
- Flash Attention 3 is used via the
kernels package, with a Hopper-specific path vs. a community fallback
- Model dimensions are computed from a DEPTH × ASPECT_RATIO formula, snapped to HEAD_DIM alignment
- The dataloader uses best-fit packing with BOS-aligned rows and zero padding waste
- Learning rate scheduling uses a warmup/constant/warmdown pattern based on wall-clock progress (not step count)
- Weight decay decays linearly to zero over training, and Muon uses "cautious" weight decay (only decays parameters whose gradient agrees with the current value)
Reconstructed by MacGyver from source code, documentation, and commit history. No claims about future behavior — this is what the system does today.