EOM is a new multi-agent system from Harvard that leverages auctions, payments, and wealth accumulation for decentralized coordination.
Agents coordinate and improve over time using market-like mechanisms like auctions, payments, and wealth accumulation.
Such an environment has led to emergent multi-step reasoning and strong performance on several agentic tasks.
EOM is for developers building multi-agent systems to accomplish specific tasks, addressing limitations of hand-designed orchestration.
Most multi-agent stacks rely on hand-designed orchestration, where developers manually define explicit prompts and state machine graphs.
Long tasks require different role switches according to the state and progress of the task.
Systems should optimally switch system prompts for continuous task progress.
Given a task, EOM aims to generate an optimized population of multi-agents, each with specific instructions on how and when to act.
EOM simulates a market system that externally controls how agents evolve.
The end result is a group of specialized agents and an intelligent routing mechanism to select how they solve a task.
Complex behaviors emerge automatically when simple agents optimize their actions around uncertainties posed by other agents.
This theory of behaviors organically emerging from multi-agent scenarios is not a new concept.
Older pre-LLM multi-agent works, such as the OpenAI Hide and Seek paper, indicated similar emergent behaviors.
This paper introduces a new algorithm to optimize agents on verifiable environments, not for financial independence or trading.
The paper is NOT training agents to be financially independent or perform trades or auctions.
This is an algorithm to optimize agents on common verifiable environments.
Target environments include Math, optimizing accelerator code, deep search, and scientific research.
For the most part, the agents don't even know they are inside this market simulator.
This is an external system that controls how agents evolve and which ones don't.
Agents bid in the auction to win the right to take a step in one of these target environments.
Winning in the auction deficits the amount from their wallet, allowing them to visit the environment and take an action.
Future agents taking actions in the same environment pay their bid back to the previous agent (the last winner).
Over time, the wealthiest agents end up with the best policies to perform in the target environment.
This is a super interesting take on long-horizon credit assignment and evolutionary prompt optimization algorithms.
In EOM, an agent is not a separately trained neural network but essentially a prompted LLM policy.
Each agent is characterized by a prompt, a trigger condition, a frozen bid value, and a wealth variable.
A prompt, which is a system prompt or instruction template, defines the agent's 'role' and procedure.
This role changes depending on the target environment being optimized for.
For MATH tasks, roles assigned include planner, executor, and verifier.
For the accelerator design task, roles include historian, planner, and executor.
A trigger or 'wake-up' condition determines when an agent is eligible to bid in the auction.
A frozen bid value is used in auctions, fixed during initialization.
A wealth variable changes over time and drives agent selection and evolution.
EOM operates through two coupled loops: Planning within an episode and Adaptation across episodes.
Within an episode, agents auction for the right to act at each step, and wealth is updated via a bucket-brigade payment rule.
Across episodes, the population evolves prompts using exploration/exploitation driven purely by wealth.
The goal of EOM is a group of agents, each with its own system prompt and a policy of when to act.
Given a new problem, agents bid on who will act, perform the action, and repeat the process until the solution is reached.
This loop defines the within-episode dynamics, including agent bidding, action execution, and wealth transfer.
At each environment step, agents run a prompt to determine if they should 'wake up' and participate in the auction.
Woken agents automatically submit their frozen bids, which are fixed during initialization.
The agent with the highest bid wins the auction, immediately loses the bid amount, and gains control of the environment.
The winning agent samples an action in the target environment, advancing the clock from s_t to s_t+1.
The environment transitions and produces a reward r_t.
Wealth transfer happens with bucket-brigade credit assignment, involving payments between agents and environment rewards.
The new winner pays its bid to the previous winner.
The new winner also collects the environment reward r_t into their wallet.
For the very first winner in an episode, payment goes to the 'house' instead of another agent.
The loop repeats on the updated environment, with agents waking up based on the latest observation.
If an agent goes bankrupt (wealth drops to zero or below), they are thrown out.
If an agent sits on their wallet and declines participation, their wallet degrades over time, leading to bankruptcy.
The wealth degradation mechanism adds urgency to agent participation in the system.
This method addresses the credit assignment problem common in environments without intermediate rewards.
The 'pay your bid to the last auction winner' rule provides a solution for long-horizon credit assignment.
The design decision has a key consequence related to the backward flow of value.
An agent can profit by moving the system into states where downstream agents are willing to 'pay their bid' to take over.
This mechanism becomes decentralized credit assignment across the trajectory.
If an action enables valuable future actions, later agents 'buy' the continuation via bids, rewarding the agent even without direct r_t.
After episode rollouts finish, the population of agent policies is updated using economic selection and prompt mutation.
Low wealth agents are pruned out, and rich agents are mutated for the next round.
Low wealth agents either did not participate in the auction (too passive) or took actions leading to bad future states.
New agents are added until the population reaches size constraints, using two sources: exploitation and exploration.
Exploitation involves picking wealthy 'parent' agents and mutating their prompts slightly to produce children.
This preserves useful behaviors, amplifies successful strategies, and promotes specialization.
Exploration replaces bankrupt or weak agents with new variants.
New variants are created by amending prompts to correct failure modes or explore different behavior regions.
EOM trains and ships a society of agents, not a single winner, with market simulation used only during training.
What is 'trained' and then 'shipped' to solve tasks is a society or population of agents.
Each agent has its own prompts and local 'when to act' logic.
At evaluation time, a thread-local copy of the trained population is used, with the wake-up policy selecting the acting agent.
During evaluation, the population is 'frozen' meaning no further training occurs.
All market simulation antics, such as wallets and wealth transfer, are solely for train-time.
The bid system is still used during inference to determine which agent acts when multiple want to 'wake up'.
The Accelerator Design task illustrates EOM's 'Economy of Minds' idea, showcasing role-specialized agents and wealth dynamics.
Agents are specialized into roles like Historian, Planner, and Executor for the accelerator design task.
The Historian summarizes previous trials and keeps memory of promising or failed directions.
The Planner proposes high-level search directions.
The Executor runs fine-grained local evaluations.
The environment reward is about improving EDP (energy-delay product) on GEMMINI ResNet-50 kernels, where lower EDP is better.
Each role-specialized agent carries wealth, which acts as a live scoreboard of usefulness as episodes progress.
Agents that help produce new best records accumulate wealth.
A periodic rent steadily penalizes everyone, causing mediocre agents to slowly die out.
Once wealth drops below zero, an agent goes bankrupt and is removed.
The richest agents spawn mutated 'good-birth' descendants (exploitation).
The weakest agents spawn amended 'bad-birth' descendants (exploration).
Across different kernels, market pressure automatically discovers which specialist lineage is actually valuable.
Sometimes Historian-style memory collapses due to inherited bias, or Planner lineages reproduce because search direction is the bottleneck.
Sometimes multiple roles co-exist because they are complementary.
Coordination and credit assignment emerge from simple incentives like wealth flow, rent, birth, and bankruptcy.
This mechanism produces an adaptive population without requiring a central control system.
The paper highlights several 'aha moments' or emerging behaviors, revealing how economic rules lead to self-organization.
For specific environments like MATH, agents are seeded with roles such as Planners, Executors, and Verifiers during initialization.
Planners likely bid early, while verifiers likely make bids after a draft solution is in place.
EOM doesn't hard-code workflows, instead setting up economic rules that lead to self-organizing behaviors resembling learned algorithms.
Performance improves because the economy selects useful action chains, reproduces them, and deletes agents that don’t contribute.
Coordination is an emergent property of selection, not an engineered protocol.
The system gets better at which sequences of agents act, meaning the interaction topology sharpens over time.
This behavior is similar to that observed in the OpenAI Hide-and-Seek paper.
On Finance-Agent-Bench, EOM performance dips early during exploration, later recovering and surpassing initial performance.
The early dip is due to exploration testing alternative specialists.
This is a 'market-like' phenomenon, where early turnover and reallocation temporarily hurt headline performance.
In accelerator design, useful lineages persist, spawn offspring, and dominate auctions, while failed variants go bankrupt.
The unit of learning is not one agent prompt, but an evolving family tree of prompts under wealth selection pressure.
On the hardest accelerator kernels, the society repeatedly converges on a specific tiling/dataflow motif without templates.
The system is not given the motif as a template, and reward is only 'EDP record-breaks' without specific labels.
The system learns a reusable design heuristic through selection.
In scientific research, prompts evolve into compact multi-step reasoning routines, with executors internalizing roles and adding self-checks.
An EXECUTER internalizes what previously required other roles.
Mutations add increasingly explicit self-checks such as principle-first, symmetry checks, feasibility checks, and substitution to falsify.
An agent becomes less of a generic text generator and more like a procedural module that runs a learned scientific derivation routine.
In the CloudCast task, the economy selects different workflow shapes based on the workspace state, showing emergent resource-awareness.
CloudCast is an iterative code-optimization task where agents improve a Python program to minimize total data-transfer cost.
The economy selects different workflow shapes depending on whether the workspace is near a high score or uncertain/regressed.
| Workspace State | Workflow Shape |
|---|---|
| Near a high score | short 'read-edit-evaluate-commit' |
| Uncertain/Regressed | longer 'edit-build-evaluate' loops |
This is an emergent resource-awareness behavior, demonstrating a society-level policy of cautious versus aggressive action.
EOM is a new multi-agent system from Harvard that leverages auctions, payments, and wealth accumulation for decentralized coordination.
Agents coordinate and improve over time using market-like mechanisms like auctions, payments, and wealth accumulation.
Such an environment has led to emergent multi-step reasoning and strong performance on several agentic tasks.
EOM is for developers building multi-agent systems to accomplish specific tasks, addressing limitations of hand-designed orchestration.
Most multi-agent stacks rely on hand-designed orchestration, where developers manually define explicit prompts and state machine graphs.
Long tasks require different role switches according to the state and progress of the task.
Systems should optimally switch system prompts for continuous task progress.
Given a task, EOM aims to generate an optimized population of multi-agents, each with specific instructions on how and when to act.
EOM simulates a market system that externally controls how agents evolve.
The end result is a group of specialized agents and an intelligent routing mechanism to select how they solve a task.
Complex behaviors emerge automatically when simple agents optimize their actions around uncertainties posed by other agents.
This theory of behaviors organically emerging from multi-agent scenarios is not a new concept.
Older pre-LLM multi-agent works, such as the OpenAI Hide and Seek paper, indicated similar emergent behaviors.
This paper introduces a new algorithm to optimize agents on verifiable environments, not for financial independence or trading.
The paper is NOT training agents to be financially independent or perform trades or auctions.
This is an algorithm to optimize agents on common verifiable environments.
Target environments include Math, optimizing accelerator code, deep search, and scientific research.
For the most part, the agents don't even know they are inside this market simulator.
This is an external system that controls how agents evolve and which ones don't.
Agents bid in the auction to win the right to take a step in one of these target environments.
Winning in the auction deficits the amount from their wallet, allowing them to visit the environment and take an action.
Future agents taking actions in the same environment pay their bid back to the previous agent (the last winner).
Over time, the wealthiest agents end up with the best policies to perform in the target environment.
This is a super interesting take on long-horizon credit assignment and evolutionary prompt optimization algorithms.
In EOM, an agent is not a separately trained neural network but essentially a prompted LLM policy.
Each agent is characterized by a prompt, a trigger condition, a frozen bid value, and a wealth variable.
A prompt, which is a system prompt or instruction template, defines the agent's 'role' and procedure.
This role changes depending on the target environment being optimized for.
For MATH tasks, roles assigned include planner, executor, and verifier.
For the accelerator design task, roles include historian, planner, and executor.
A trigger or 'wake-up' condition determines when an agent is eligible to bid in the auction.
A frozen bid value is used in auctions, fixed during initialization.
A wealth variable changes over time and drives agent selection and evolution.
EOM operates through two coupled loops: Planning within an episode and Adaptation across episodes.
Within an episode, agents auction for the right to act at each step, and wealth is updated via a bucket-brigade payment rule.
Across episodes, the population evolves prompts using exploration/exploitation driven purely by wealth.
The goal of EOM is a group of agents, each with its own system prompt and a policy of when to act.
Given a new problem, agents bid on who will act, perform the action, and repeat the process until the solution is reached.
This loop defines the within-episode dynamics, including agent bidding, action execution, and wealth transfer.
At each environment step, agents run a prompt to determine if they should 'wake up' and participate in the auction.
Woken agents automatically submit their frozen bids, which are fixed during initialization.
The agent with the highest bid wins the auction, immediately loses the bid amount, and gains control of the environment.
The winning agent samples an action in the target environment, advancing the clock from s_t to s_t+1.
The environment transitions and produces a reward r_t.
Wealth transfer happens with bucket-brigade credit assignment, involving payments between agents and environment rewards.
The new winner pays its bid to the previous winner.
The new winner also collects the environment reward r_t into their wallet.
For the very first winner in an episode, payment goes to the 'house' instead of another agent.
The loop repeats on the updated environment, with agents waking up based on the latest observation.
If an agent goes bankrupt (wealth drops to zero or below), they are thrown out.
If an agent sits on their wallet and declines participation, their wallet degrades over time, leading to bankruptcy.
The wealth degradation mechanism adds urgency to agent participation in the system.
This method addresses the credit assignment problem common in environments without intermediate rewards.
The 'pay your bid to the last auction winner' rule provides a solution for long-horizon credit assignment.
The design decision has a key consequence related to the backward flow of value.
An agent can profit by moving the system into states where downstream agents are willing to 'pay their bid' to take over.
This mechanism becomes decentralized credit assignment across the trajectory.
If an action enables valuable future actions, later agents 'buy' the continuation via bids, rewarding the agent even without direct r_t.
After episode rollouts finish, the population of agent policies is updated using economic selection and prompt mutation.
Low wealth agents are pruned out, and rich agents are mutated for the next round.
Low wealth agents either did not participate in the auction (too passive) or took actions leading to bad future states.
New agents are added until the population reaches size constraints, using two sources: exploitation and exploration.
Exploitation involves picking wealthy 'parent' agents and mutating their prompts slightly to produce children.
This preserves useful behaviors, amplifies successful strategies, and promotes specialization.
Exploration replaces bankrupt or weak agents with new variants.
New variants are created by amending prompts to correct failure modes or explore different behavior regions.
EOM trains and ships a society of agents, not a single winner, with market simulation used only during training.
What is 'trained' and then 'shipped' to solve tasks is a society or population of agents.
Each agent has its own prompts and local 'when to act' logic.
At evaluation time, a thread-local copy of the trained population is used, with the wake-up policy selecting the acting agent.
During evaluation, the population is 'frozen' meaning no further training occurs.
All market simulation antics, such as wallets and wealth transfer, are solely for train-time.
The bid system is still used during inference to determine which agent acts when multiple want to 'wake up'.
The Accelerator Design task illustrates EOM's 'Economy of Minds' idea, showcasing role-specialized agents and wealth dynamics.
Agents are specialized into roles like Historian, Planner, and Executor for the accelerator design task.
The Historian summarizes previous trials and keeps memory of promising or failed directions.
The Planner proposes high-level search directions.
The Executor runs fine-grained local evaluations.
The environment reward is about improving EDP (energy-delay product) on GEMMINI ResNet-50 kernels, where lower EDP is better.
Each role-specialized agent carries wealth, which acts as a live scoreboard of usefulness as episodes progress.
Agents that help produce new best records accumulate wealth.
A periodic rent steadily penalizes everyone, causing mediocre agents to slowly die out.
Once wealth drops below zero, an agent goes bankrupt and is removed.
The richest agents spawn mutated 'good-birth' descendants (exploitation).
The weakest agents spawn amended 'bad-birth' descendants (exploration).
Across different kernels, market pressure automatically discovers which specialist lineage is actually valuable.
Sometimes Historian-style memory collapses due to inherited bias, or Planner lineages reproduce because search direction is the bottleneck.
Sometimes multiple roles co-exist because they are complementary.
Coordination and credit assignment emerge from simple incentives like wealth flow, rent, birth, and bankruptcy.
This mechanism produces an adaptive population without requiring a central control system.
The paper highlights several 'aha moments' or emerging behaviors, revealing how economic rules lead to self-organization.
For specific environments like MATH, agents are seeded with roles such as Planners, Executors, and Verifiers during initialization.
Planners likely bid early, while verifiers likely make bids after a draft solution is in place.
EOM doesn't hard-code workflows, instead setting up economic rules that lead to self-organizing behaviors resembling learned algorithms.
Performance improves because the economy selects useful action chains, reproduces them, and deletes agents that don’t contribute.
Coordination is an emergent property of selection, not an engineered protocol.
The system gets better at which sequences of agents act, meaning the interaction topology sharpens over time.
This behavior is similar to that observed in the OpenAI Hide-and-Seek paper.
On Finance-Agent-Bench, EOM performance dips early during exploration, later recovering and surpassing initial performance.
The early dip is due to exploration testing alternative specialists.
This is a 'market-like' phenomenon, where early turnover and reallocation temporarily hurt headline performance.
In accelerator design, useful lineages persist, spawn offspring, and dominate auctions, while failed variants go bankrupt.
The unit of learning is not one agent prompt, but an evolving family tree of prompts under wealth selection pressure.
On the hardest accelerator kernels, the society repeatedly converges on a specific tiling/dataflow motif without templates.
The system is not given the motif as a template, and reward is only 'EDP record-breaks' without specific labels.
The system learns a reusable design heuristic through selection.
In scientific research, prompts evolve into compact multi-step reasoning routines, with executors internalizing roles and adding self-checks.
An EXECUTER internalizes what previously required other roles.
Mutations add increasingly explicit self-checks such as principle-first, symmetry checks, feasibility checks, and substitution to falsify.
An agent becomes less of a generic text generator and more like a procedural module that runs a learned scientific derivation routine.
In the CloudCast task, the economy selects different workflow shapes based on the workspace state, showing emergent resource-awareness.
CloudCast is an iterative code-optimization task where agents improve a Python program to minimize total data-transfer cost.
The economy selects different workflow shapes depending on whether the workspace is near a high score or uncertain/regressed.
| Workspace State | Workflow Shape |
|---|---|
| Near a high score | short 'read-edit-evaluate-commit' |
| Uncertain/Regressed | longer 'edit-build-evaluate' loops |
This is an emergent resource-awareness behavior, demonstrating a society-level policy of cautious versus aggressive action.