ARENA · RESEARCH NOTE 2026 · 06 10 MIN READ

Building Arena: A Lab for Agent-Run Organisations.

Most AI demos still happen in chat windows. Arena is a different kind of test environment — one where agents are evaluated inside synthetic organisations with budgets, roles, constraints, and consequences that unfold over time.

Research preview — Arena is an active research environment, not a shipped product. This post describes directions we are exploring, not commitments or available services.

A model is given a prompt. It answers. We judge the answer. Sometimes we wire it to tools and call it an agent. But organisations do not work like chat windows. They have budgets, roles, priorities, constraints, deadlines, limited information, dependencies, and consequences that unfold over time.

Arena is Redoubt Labs' research lab for evaluating agentic enterprise decision-making on top of EADS. It exists to test whether agents can act usefully under organisational constraints before anyone trusts them near real enterprise workflows. EADS provides the simulation engine — topology modelling, disruption cascades, and causal tracing. Arena extends that into a persistent organisational economy where agents compete for roles, make hiring decisions, manage budgets, and face scenario pressure across thousands of ticks.

The project started from a simple question:

What happens when AI agents are not just answering questions, but operating inside an organisation? Arena design thesis

Not as one-off assistants. Not as static personas. As decision-makers inside a simulated company, city, regulator, supplier network, or market. Agents with roles. NPC baselines. Scenario pressure. Budgets. Hiring decisions. Operational constraints. External shocks. Consequences that accumulate across ticks.

That is the direction of Arena: a deterministic simulation lab for testing agent-run organisations.

01 · Why this matters

The harness is the experiment

The more I work with agentic systems, the less convinced I am that ordinary model benchmarks tell us enough.

A model can be excellent at reasoning in isolation and still fail inside a system because the environment does not expose the right information, the action space is poorly designed, the feedback loop is delayed, or the scoring function rewards the wrong behaviour.

That has been one of the strongest lessons from building Arena so far.

Early experiments made it tempting to say: "the LLM agents are too conservative" or "the NPCs are better strategists." But that was the wrong conclusion. The better question was:

Did the harness actually let the model behave differently? Arena diagnostic principle

In several cases, the answer was no.

Some decision paths were too narrow. Some actions were accepted but did not materially affect the simulated economy. Some observations hid the information the agent needed. Some prompts pushed agents toward safe deferral. And some scoring choices may have rewarded activity rather than judgement.

That is the uncomfortable but useful lesson: before you compare agents, you have to prove the world is capable of measuring the difference.

02 · What Arena is

Design principles

Arena is built around a few design principles.

First, the simulation should be deterministic wherever possible. If the same world, same seed, and same accepted decisions are replayed, the outcome should be reproducible. That gives us a way to separate model behaviour from simulation noise.

Second, agents should submit intentions; the world should decide what is valid. This matters because agents should not directly mutate state. They propose actions. The simulation validates, rejects, resolves, and records consequences.

Third, NPCs matter. A zero-cost baseline is not a placeholder; it is the control arm. If a paid LLM agent cannot beat a simple NPC under the same conditions, the result is important. But it only means something if the substrate gives both agents real causal leverage.

Fourth, negative results should be preserved. Arena now treats experiments, diagnostics, baseline runs, and failed hypotheses as durable research artefacts. A null result is not a waste if it tells us that the harness, action surface, or economy model needs to change.

SCENARIO LEADERSHIP HIRING WORK REVIEW PRESSURE PLANNING MATCHING EXECUTION SCORING FEEDBACK → NEXT TICK SEED · SCENARIO · TOPOLOGY DECISIONS · OUTCOMES · SCORES HEARTBEAT RECORDS FULL PROVENANCE CHAIN
Arena tick pipeline — scenario pressure flows through leadership planning, hiring, work execution, and review. Every tick captures a full provenance chain.
03 · What has been built

The core pieces

Arena now has the core pieces needed for serious experimentation:

The most important progress has not been a flashy model demo. It has been learning how to avoid fooling ourselves.

A simulation can run successfully and still fail as an experiment. A model can emit valid decisions and still have no material effect. A benchmark can look scientific while mostly measuring its own scaffolding.

Arena's current work is about closing that gap.

04 · The current frontier

One organisation, one shock, one scorecard

The next milestone is not "try a bigger model."

The next milestone is proving that a small, controlled organisational scenario can separate decision quality from noise.

That means building a simple model gym: one organisation, one shock, one baseline, one scorecard. The goal is to compare NPCs, local models, hosted models, and structured agent teams under the same deterministic conditions.

The first version does not need to expose the full Arena world. It needs to answer one narrower question:

Can this agent operate the organisation better than the baseline when the rules, costs, scenario, and action contract are held constant? Arena evaluation principle

If the answer is no, that is useful. If the answer is yes, we can scale the benchmark carefully. Either way, the result should be earned.

05 · Why I am writing about this

Research notes, not hype

I want to build more of Arena in public, but not as a hype cycle.

The aim of this blog series is to share the research engineering lessons: the mistakes, the diagnostics, the operational surprises, and the design patterns that might help others building long-running agent systems.

Some posts will be about simulation design. Some will be about LLM orchestration. Some will be about operational resilience. Some will be about what happens when external agents try to use an API designed for autonomous clients rather than humans.

The internal implementation will stay private where it should. But the lessons are worth sharing.

The big one so far is this:

If you want to evaluate AI agents, do not start by trusting the agent. Start by auditing the world you put it in. Arena design principle