← Blog

The Hard Pocket Problem

The IRIS framework produces a clear, measurable improvement in coordination outcomes — ToMCoordScore rising from 0.11 (baseline) to 0.38 (best candidate at 140k episodes), collision rate dropping from 0.34 to 0.11. These are the headline numbers.

What the framework also surfaces — and this is perhaps its more important contribution — is the region where improvement does not happen.

Under operational pressure — high urgency, narrow safety margins, throughput-biased norms — the agent's policy is pushed toward regretful tradeoffs. It has correctly identified the partner as cooperative, but the situation demands assertiveness. It has correctly identified the partner as assertive, but the situation demands yielding. The belief state is accurate. The context is read correctly. The two conflict, and the agent makes a decision that produces a suboptimal outcome.

We call this the hard pocket: a bounded region of the coordination space where better inference does not translate into better outcomes, because the inference and the correct action are in tension.

The hard pocket is not a bug. It is an empirical discovery — a measurable operational constraint on what Theory of Mind can achieve in multi-agent coordination under partial observability. It mirrors a recognised problem in agentic AI deployment (Shapira et al., 2026): systems that perform well under standard conditions degrade under the specific combination of time pressure, conflicting demands, and incomplete information.

The practical significance is that a well-characterised hard pocket is more useful than an unexamined claim of general coordination ability. An evaluation framework that reports where it does not work alongside where it does enables deployment decisions that account for operational boundaries. The IRIS hard pocket taxonomy — urgency-pressure regime, margin-narrow regime, norm-conflict regime — provides exactly this boundary specification.

Phase 1 of the IRIS development roadmap tests whether the hard pocket is architectural (specific to GRU-based learners) or fundamental (an information-theoretic limit on coordination under partial observability). Both outcomes are publishable and practically useful.