Now I have everything I need. Let me write the spec.

autoresearch — Product Specification

Reverse-engineered from karpathy/autoresearch · April 2026


Problem Statement

Training a good language model involves thousands of small decisions — architecture tweaks, hyperparameter sweeps, optimizer changes, batch size experiments. A human researcher makes one change, waits for a training run to finish, reads the results, thinks about what to try next, and repeats. It's slow, because humans need to eat, sleep, and attend meetings. Most of the wall-clock time in a research loop is the researcher not being at the keyboard.

autoresearch asks: what if the AI agent is the researcher? Give it a real training setup on a real GPU, let it modify the code, run the experiment, check if results improved, and decide what to try next — all by itself, all night long. You go to sleep; you wake up to a log of 100 experiments and (hopefully) a measurably better model.

This isn't a training framework. It's a research automation protocol. The AI agent is the researcher, the human is the research director, and the program.md file is how the director communicates strategy.

Actors & Goals

The Human (Research Director)

The AI Agent (Autonomous Researcher)

The Training Script (train.py)

The Evaluation Harness (prepare.py)

Operator Value

Core Capabilities

1. Autonomous Experiment Loop

The system runs a continuous cycle: propose a change → implement it → run the experiment → evaluate the result → keep or discard → repeat. This loop runs indefinitely until manually stopped. The agent never pauses to ask the human for guidance.

2. Fixed-Budget Fair Comparison

Every experiment trains for exactly 5 minutes of wall-clock time (excluding startup and compilation overhead). This makes experiments comparable regardless of what the agent changed — whether it doubled the model size, halved the batch size, or swapped the optimizer. The constraint is time, not steps or tokens.

3. Single-Metric Evaluation

All experiments are judged by one number: validation bits per byte (val_bpb). Lower is better. This metric is vocabulary-size-independent, so architectural changes that modify the tokenizer interaction or vocab dimensions are still fairly compared. The evaluation harness is fixed and cannot be modified.

4. Git-Based Experiment Management

Each research session runs on a dedicated git branch (e.g., autoresearch/mar5). The agent commits before each experiment, advances the branch when results improve, and reverts to the last-known-good commit when they don't. The branch tip always represents the best result so far.

5. Structured Result Logging

Every experiment — successful, failed, or crashed — is logged to a TSV file with the commit hash, val_bpb score, peak memory usage, keep/discard/crash status, and a human-readable description of what was tried. This file is intentionally kept out of git (untracked) to avoid merge noise.

6. Research Direction via Markdown

The human programs the agent by editing program.md — a plain markdown file that provides context, constraints, and strategy. This is the interface between human intent and agent behavior. The human never touches Python directly; they write prose.

7. Crash Recovery and Resilience

When an experiment crashes (out-of-memory, bugs, numerical instability), the agent diagnoses the failure, logs it, reverts, and moves on. Simple mistakes (typos, missing imports) are fixed and retried. Fundamentally broken ideas are abandoned gracefully. The loop continues.

8. Simplicity-Weighted Decision Making

The agent doesn't just chase lower numbers. A marginal improvement that adds ugly complexity is not worth keeping. A marginal improvement from deleting code is a clear win. Equal performance with simpler code is a keep. The system values elegance alongside performance.

Observable Behaviors

Starting a Research Session

Running an Experiment

Evaluating and Deciding

Data Preparation (One-Time)

Manual Baseline Run

Edge Cases

Non-Functional Constraints

Non-Goals

Suspected Implementation Leakage

The following observations are about how autoresearch is built rather than what it promises. They're noted here for completeness but belong in a technical spec:


Reconstructed by MacGyver from source code, documentation, and commit history. No claims about future behavior — this is what the system does today.