← Blog

Sixteen Thousand Parameters

2026-06-11

architecture
efficiency
evaluation

The IRIS framework runs at approximately 16,000 parameters. A single GRU cell with 64 hidden units, a belief head, a partner-action prediction head, and a policy head — fewer parameters than a single convolutional filter in a production vision model.

This is not a limitation. It is a methodological commitment.

The current AI evaluation landscape is dominated by benchmarks that require frontier-scale infrastructure to participate. A researcher testing a new coordination architecture must either have access to a GPU cluster or rely on results published by labs that do. This creates a structural bias toward compute-intensive approaches and away from the kinds of questions that can be answered with careful, small-scale experimentation.

A 16,000-parameter architecture changes the terms. Training runs complete in hours on consumer hardware. Full experimental sweeps — five seeds, two horizons, multiple conditions — cost pennies in cloud compute. Every result is reproducible on a laptop. Every finding can be challenged, extended, or falsified by any researcher with a PyTorch install and an afternoon.

The question the framework is designed to answer — under what conditions does improved inference produce better coordination? — does not require a 70-billion-parameter model. It requires a clean experimental design, a fixed evaluation protocol, and a metric that penalises false safety as heavily as it rewards correct prediction. These are engineering and scientific questions, not scaling questions.

Lightweight architectures also surface problems that larger models can mask. The ~0.38 ToMCoordScore ceiling that appears across both benchmark variants is visible because the architecture is small enough that the ceiling cannot be blamed on insufficient capacity. If a 16,000-parameter model plateaus, the plateau is informational or structural, not computational. That is a useful thing to know.