Agents Under the Executive

How Bezos's three-decisions-a-day principle should shape the design of AI agents in leadership functions.

Written by Gen Vagula · CEO & Co-Founder, Ampron OÜ · May 2026

01The Principle Worth Carrying Forward

Bezos's three-decisions-a-day claim is not a productivity tip. It is a statement about what executive labour is for, and that statement is the right starting point for thinking about AI agents in leadership functions.

The argument, stripped to its core, is this. Executive judgement is scarce. The number of consequential, hard-to-reverse decisions an executive can make well in a day is small. Therefore the whole organisation should be designed to ration that scarce thing — to push routine decisions away from the executive, to insulate the executive's morning hours, to defer hard problems to when the executive is sharp, and to accept that most of the executive's day will produce no visible output because thinking does not produce visible output.

The principle has a corollary that Bezos states less explicitly but that matters more for what follows. If executive judgement is the scarce resource and routine decisions can be devolved, then the organisation needs a mechanism for devolution. Someone, or something, has to absorb the work that the executive should not be doing. In Amazon's case that mechanism was a culture of two-pizza teams, single-threaded leaders, and decision rights pushed deliberately downward. The agents are the new candidate mechanism.

That is where AI agents enter the picture. Not as a replacement for executive judgement. As the devolution mechanism. The thing that absorbs the work the executive should never have been doing in the first place, so that the executive's scarce attention is preserved for the small number of calls that warrant it.

The Right Framing

The goal of an AI executive agent is not to act like an executive. The goal is to absorb the work that lets a human executive finally do the three things a day that only the executive can do. The agents are infrastructure for human judgement, not substitutes for it.

Most of what gets written about executive AI agents loses this distinction. It treats the agent as a junior executive in training, to be promoted into broader authority as it demonstrates competence. That framing is wrong on its own terms, and it leads to design choices that produce a faster bureaucracy rather than a better one. The framing this memo will defend is the opposite. The agent's purpose is to take work off the executive's desk, not to take the desk itself.

02The Scarcity Has Moved

Bezos rationed executive attention because executive attention was the bottleneck. With agents in the system, attention is no longer the scarce resource. The design has to follow the scarcity to wherever it went.

A human executive can think hard about a few things a day. An agent can produce a hundred reasoned outputs an hour and would happily produce a thousand if asked. The original Bezos rationing logic, applied directly to an agent, would be a category error. There is no reason to ration the agent's decision count. There is every reason to ask where the new scarcity lives, because something is always scarce in a working system.

Three resources become scarce in any agent-augmented executive function. Each one is worth naming because each one demands its own form of protection.

Human review capacity

If the agents produce decisions at machine speed and a human reviews the consequential ones, the human's review time becomes the bottleneck. This is the bottleneck most likely to fail silently, because the human will not stop reviewing — they will simply review more shallowly. The decisions go through. The audit trail looks complete. The review was nominal. By the time anyone notices that the human became a rubber stamp, the system has been operating without meaningful oversight for months.

Organisational trust

The right to deploy agents in consequential functions is granted by the organisation and is not unconditional. Each visible mistake by an agent withdraws some of that grant. Trust does not recover linearly — a single visible failure can cost more credibility than ten quiet successes earned. So the rate at which agents can take on consequential work is governed not by their capability but by the rate at which the organisation can absorb their visible failures without losing confidence in the entire programme.

Accountability

An agent cannot be fired, cannot lose reputation, cannot have its bonus clawed back. Whatever accountability the system carries has to land on humans, and there is a finite amount of accountability any human will accept on behalf of decisions they did not personally make. This is the most under-discussed scarcity and the one that, in practice, determines how far agent authority can be extended.

The Reframed Design Problem

The question is not how to make the agent decide well. The question is how to structure the system so that human review capacity, organisational trust, and personal accountability are not exhausted faster than they regenerate.

The implication is that every design choice about an agent should be evaluated against these three resources, not against the agent's apparent capability. An agent that produces excellent decisions but consumes more review time than the reviewer has is a net negative. An agent that performs well in testing but produces one visible embarrassment per quarter is a net negative. An agent that operates in a domain where no human will accept the consequence of its mistakes is a net negative. Capability is a necessary condition. It is not the design constraint.

03Reversibility, Properly Defined

Bezos's distinction between one-way and two-way doors is the right intellectual tool, but it has to be applied to agent systems with more care than it needs in human organisations.

For a human executive, the reversibility classification is usually obvious. Signing a long-term contract is irreversible. Choosing this week's marketing copy is reversible. The human can see, before acting, roughly what the consequence will be and whether it can be undone.

For an agent, the classification is harder for three reasons that have to be designed around.

The action and the consequence are not the same

An agent sending an email to a customer has taken a Type 2 action in mechanical terms — it is just an email. But the email cannot be unsent, the tone cannot be uncommitted, and the implied promise cannot be retracted. The reversibility of the act and the reversibility of its consequence are different questions, and the consequence is what matters. Any agent system that classifies actions by their mechanical reversibility rather than their consequential reversibility will systematically under-rate its own blast radius.

Volume changes the classification

A human making a hundred small Type 2 decisions a year can recover from a few being wrong. An agent making a hundred Type 2 decisions a day produces an error rate whose cumulative effect, even at low per-decision error, becomes a serious problem. A category that is genuinely Type 2 at human volumes can become functionally Type 1 at agent volumes, simply because the cumulative impact of being slightly wrong across thousands of instances becomes hard to walk back.

Chains compound

When one agent's output feeds another agent's input, the chain introduces a new failure mode. A subtle error in the first agent — a confident estimate that should have been a hedged hypothesis, a missing caveat, an assumption left unstated — propagates downstream with no awareness in any link of the chain that propagation has occurred. By the fifth handoff, the system is acting on a conclusion that no individual agent would have stood behind in isolation. Each step looks reversible. The chain is not.

Reversibility cannot be assessed at the level of the individual action. It has to be assessed at the level of the cumulative system, at full volume, with the chain of consumers downstream. Anything else is wishful classification.

The working rule that follows is straightforward to state and harder to apply. Before granting an agent autonomy on a class of decisions, ask: if a thousand of these decisions are made and ten percent are wrong, can the organisation walk back the cumulative consequence? If the answer is yes, the class is genuinely Type 2 and the agent can own it. If the answer is no — if the cumulative effect of a tolerable error rate would still be intolerable — then the class is Type 1 regardless of how Type 2 any individual instance looks. The agent's authority should not extend there.

The Classification Test

Reversibility is a property of the system, not of the act. Ask whether the cumulative consequence of a realistic error rate is reversible. If not, classify the category as Type 1 and keep it above the agent's authority line.

04Accountability Is The Ceiling

The reason executive AI agents are stuck at narrow authority is not that they lack capability. It is that no organisation has been willing to pre-commit to who carries the consequence of their mistakes.

Accountability does not come from documentation. It does not come from configuration. It comes from a specific human whose career, money, or reputation is at stake when something goes wrong. An agent has none of these. So when an agent makes a consequential bad call, the accountability must land on a human, and the question of which human is decided not by the operating system documents but by what the organisation will actually tolerate when the mistake becomes visible.

In any real deployment, this question has three possible answers, and each one has a consequence the designer should be honest about.

The deploying executive carries it

The CEO or department head who put the agent into operation is on the hook for what it does, structurally similar to vicarious liability for an employee. This is the cleanest answer and the one most likely to produce healthy behaviour, because the deploying executive will rationally constrain the agent to a scope they can stand behind. The cost is that the agent's authority will be tightly bounded — anything outside the executive's comfort zone for personal accountability will be excluded.

The team carries it collectively

Everyone touched by the agent's output shares some responsibility. This sounds modern and is, in practice, equivalent to no one being accountable. Distributed accountability dissolves on contact with a bad outcome. The organisation learns nothing from the failure because no one is positioned to learn from it. This is the answer that emerges by default when no one explicitly chooses, and it is the worst of the three.

The agent's authority stays narrow enough that the question rarely arises

The agent only acts in domains where the worst plausible mistake is small enough that the deploying executive can absorb it without political consequence. This is the safest answer, the least ambitious answer, and the one most real deployments converge on regardless of what the original design document said.

The third answer is the one this memo recommends, not because it is exciting, but because it is the only one that survives contact with an organisation under stress. Agents earn broader authority by accumulating a visible track record of decisions made well at narrow scope, not by being granted broader authority upfront on the strength of their apparent capability. The expansion is gradual, evidence-based, and reversible. Anything else is unearned trust, and unearned trust is the resource that collapses fastest when something goes wrong.

The Diagnostic to Run

Before deploying an executive agent, write down the name of the person who will explain the agent's worst plausible mistake to the board or the customer. If you cannot name that person, the agent is not ready. If the named person would not accept the role, the agent's scope is too broad.

This is the framing that breaks the most ambitious visions of executive AI. The agent that operates with CEO-level authority is not blocked by the model. It is blocked by the absence of any human willing to pre-commit to bearing the consequence of its mistakes. That commitment has to be made before the authority is granted, and most organisations will not make it. So the agent stays narrow, and the broader vision waits.

05Compounding Error in Agent Chains

The most underappreciated failure mode in multi-agent systems is the inflation of confidence at each handoff. The design has to fight it deliberately because no agent in the chain can see it from inside.

In a human organisation, mistakes get caught by friction. Someone questions an assumption in a meeting. A colleague re-reads a draft overnight. A subordinate pushes back on a number that does not feel right. None of this is a formal review process; it is the natural texture of working with other humans, and it dampens cascading errors before they propagate.

In an agent system, the texture is missing. One agent's output goes into the next agent's input without anyone reading it. A wrong assumption made at step one becomes the foundation of fifty downstream decisions, none of which know they are downstream of an assumption. By the time a human looks at the final output, the original error is buried under layers of confident-sounding analysis that all rest on it.

The mechanism is worth being precise about, because the fix depends on understanding it. Each agent, by default, produces output that is slightly more confident than its input warranted. The first agent says "based on these three data points, the figure is probably around fifteen percent." The second agent, summarising for the third, says "the figure is around fifteen percent." The third agent, acting on it, says "the figure is fifteen percent." By the fourth handoff, the figure has the rhetorical weight of an established fact. Nobody downstream remembers it was an estimate. Nobody can see the chain from where they stand.

Each layer of agent-to-agent handoff inflates confidence by a small amount. Five layers in, the system is acting on figures that should be hedged hypotheses but are being treated as facts. No agent in the chain can see this from the inside.

The design response is to put the friction back in deliberately. Four mechanisms are worth considering, listed in rough order of how much friction they introduce relative to how much they cost the system.

Preserve confidence framing across handoffs

An agent consuming another agent's output must not strip its hedges. If the input was "probably around fifteen percent, based on three data points, with the assumption that the market continues to grow," the output must carry the same uncertainty forward. This is enforced through instruction and through audit, and it is the cheapest fix to implement.

Insert human checkpoints at compounding nodes

Identify the points where many downstream decisions depend on a single upstream output. Those are the high-leverage error points. A human review at the compounding node is worth more than ten reviews at the leaves, because the upstream error, if caught, prevents all the downstream errors that would have inherited it.

Cap chain depth for consequential decisions

The number of agent-to-agent handoffs before a human review should be small — two, perhaps three. Beyond that, the human is no longer reviewing a decision; they are auditing a process they cannot reconstruct. Depth is the enemy of meaningful review.

Build a red-team agent into the system

One agent's job is to attack the weakest assumption in the chain. Not to make decisions. To force the rest of the system to defend its confidence. This is the structural equivalent of the colleague who pushes back at the meeting, and in well-designed systems it catches things no individual reviewer would have caught.

The Hidden Tax

Every layer of agent-to-agent handoff carries a hidden confidence inflation tax. The tax is invisible per step and visible only in the final output. Without deliberate friction inserted by design, the system is paying it whether or not anyone has noticed.

06Doctrine That Gets Used, Not Doctrine That Decorates

An agent system needs governing documents, but the volume of documentation is inversely correlated with its effective force. Four short documents applied with discipline will govern an agent better than fourteen long ones that no one re-reads.

The temptation, when designing an agent system, is to formalise everything. A constitution document. A principles document. A decision framework document. An escalation policy. A knowledge scope document. A quality rubric. A meeting standard. A tone guide. Each is defensible in isolation. Taken together, they constitute the compliance manual of a large enterprise, and they will be treated the way compliance manuals are always treated: written once, filed, and not consulted at decision time.

The test for whether a document is doctrine or decoration is simple. Is it actually consulted at the moment of decision? If yes, it is doctrine. If no, it is decoration, and its existence creates false confidence in the system's governance because it looks like the governance question has been handled when it has not been.

The minimum document set that does the job, in my view, is four. Anything beyond this should justify itself against the same test.

One constitution, fitting on one screen

Five or six principles in the imperative. What the agents exist to do. What they must never do. What they must always escalate. What the organisation cares about more than speed. If the constitution is longer than one screen, it is not the lens the agents are actually filtered through. The brevity is not stylistic. It is functional. Short documents get read; long ones get skimmed.

One authority map per agent

What this agent can decide on its own. What it must consult on. What it must escalate. This is the only document that needs to be specific to each agent, and it should fit on a single page. It is reviewed quarterly. It is changed when wrong. It is the document the agent consults before acting on anything ambiguous.

One reversibility lookup

A table of action categories with their classification. Sending an email to a customer: Type 1 effect, requires review. Drafting an internal memo: Type 2, agent-owned. Updating an internal record: depends on which record. This is the single most important doctrine artefact because it is the one that gets consulted before every consequential action. It should be specific, exhaustive within scope, and updated as new categories appear.

One audit trail standard

Every consequential agent decision produces a record: what was decided, what was considered, what confidence was held, what inputs were used. Not because anyone reads it routinely. Because when something goes wrong, the audit trail is the only mechanism by which the organisation can reconstruct what happened. Without it, the post-mortem is theatre. With it, the system learns.

The Leanness Principle

A doctrine document is only valuable if it is consulted at decision time. Documents that exist but are not consulted are not doctrine — they are decoration that produces false confidence in the system's governance. Fewer documents, used harder, will outperform more documents used softly.

The instinct to formalise is correct. The volume usually is not. A founder running a fifty-person company who builds the seven-document agent governance stack of a Fortune 500 has not built better governance. They have built the appearance of better governance, which is worse than admitting the governance is informal because at least the informal version is honest about what it is.

07The Working Model for a Small Company

Strip away the corporate-officer metaphor. For a founder running a small company, the right deployment is narrower, more concrete, and less ambitious than the grand vision suggests.

Most writing about executive AI agents imagines an executive team. A CEO agent, a COO agent, a CFO agent, a chief of staff agent, all coordinating like a board. For a fifty-person company, this is the wrong target. The relevant question is not "do I need an executive team of agents." It is "what categories of decision are happening every day in my company, which of them are reversible, and which of them can I devolve to an agent without losing accountability."

A small company has a clearer view of this than a large one. The founder can see the decisions. They are visible in the inbox, in the standup, in the customer pipeline. Cataloguing them is a week's work, not a quarter's.

Step one: catalogue the decisions

For one working week, write down every decision that crosses your desk. Note its type — reversible or irreversible — and how often it recurs. By the end of the week, the catalogue will show a clear pattern: a small number of recurring Type 2 decisions consuming most of the day, and a small number of Type 1 decisions getting under-served because the Type 2 volume is crowding them out. This is the same problem Bezos described, at a smaller scale and with the categories visible by name.

Step two: identify the agent-eligible categories

The categories where three conditions hold. The action is genuinely reversible at the consequence level, not just mechanically. The cumulative effect of a tolerable error rate is something the organisation can absorb without political consequence. The audit trail can be reconstructed if anything goes wrong. Only categories meeting all three conditions are eligible. Most categories will fail at least one of them and stay with humans.

Step three: deploy narrow, not broad

One agent. One category. One human reviewer. Not a CEO-agent. Not an executive team. A specific worker on a specific recurring task, with clear authority, a clear handoff, and a named human reviewer who has accepted the accountability for its mistakes. Less ambitious than the grand vision. More likely to produce a working deployment that survives its first visible mistake.

Step four: protect the capacity that opens up

This is the step most likely to be skipped, and skipping it dissolves the value of the entire exercise. The point of pushing Type 2 work to agents is not to do more Type 2 work. It is to free up Type 1 attention. If the time freed by the agent gets immediately refilled with more Type 2 work that the founder takes back on because it now feels manageable, nothing has been gained. The discipline is to leave the recaptured time empty, and to spend it on the small number of Type 1 decisions the company actually needs from the executive.

The Translated Principle

Bezos said the executive should make a small number of high-quality decisions a day. Agents do not change that. They change which decisions reach the executive. Designed correctly, the executive ends up doing exactly what Bezos described — and the agents do everything else, narrowly, with audit trails, and with their authority deliberately under-extended.

This is the deployment most likely to work, and the one least likely to be photographed for the case study. The case study version always involves an organisation chart of agents acting like executives. The version that actually runs is one narrow agent at a time, each one absorbing a specific recurring category of work, each one bounded by an explicit named reviewer, each one extending its scope only after a visible track record of decisions made well within the boundary.

08What to Take Away

Six conclusions, in descending order of how confident I am in each.

One: agents are infrastructure for human judgement, not substitutes for it

This is the most confident claim in the memo. The right framing of an executive agent is not as a junior executive being trained for broader authority, but as a piece of infrastructure that absorbs the work an executive should not be doing. Treating agents as infrastructure produces sober design choices. Treating them as proto-executives produces grandiose ones.

Two: accountability is the ceiling, not capability

The reason executive agents stay narrow is not that they cannot do more. It is that no one is willing to pre-commit to bearing the consequence of their broader mistakes. Until the organisation answers that question explicitly, the agent's authority should not exceed the scope its named reviewer will personally stand behind.

Three: the scarcity has moved

Bezos rationed executive attention because executive attention was the bottleneck. With agents, the bottleneck is no longer there. The scarce resources become human review capacity, organisational trust, and personal accountability. Any design that does not name and protect these new scarcities will exhaust them silently.

Four: reversibility is a property of the system, not the act

Individual agent actions can look reversible while their cumulative effect, at full volume, through a chain of downstream consumers, is irreversible. Classification has to happen at the system level. The act-level classification systematically under-rates blast radius.

Five: compounding error is the underappreciated failure mode

Multi-agent systems inflate confidence at every handoff. Without deliberate friction inserted by design — preserved hedging, capped chain depth, red-team agents, checkpoints at compounding nodes — a five-step chain produces confidently wrong conclusions no agent in the chain can see from inside.

Six: doctrine is what gets consulted at decision time

Four short documents used hard will govern an agent system better than fourteen long ones used softly. The volume of documentation is inversely correlated with its effective force. Doctrine that exists but is not consulted is decoration, and it produces false confidence in the system's governance.

The right standard for an executive AI agent is not that it acts like an executive. It is that it absorbs the work executives should never have been doing in the first place, so that the executive can finally do the three things a day that only the executive can do.

That is the translation of Bezos's principle worth carrying into agent design. Not three decisions a day for the agent. Three decisions a day for the human, made possible by the agents underneath. The scarce thing is, and remains, human judgement applied at the points where it actually matters. Everything else in the system is in service of protecting it.

References

Jeff Bezos, Forum on Leadership interview, George W. Bush Presidential Center, April 2018. Source of the "three good decisions a day" framing.
Jeff Bezos, 2015 Letter to Amazon Shareholders. Type 1 and Type 2 decision framework.
Jeff Bezos, 2016 Letter to Amazon Shareholders. The seventy percent information heuristic.

Quoted phrasing is reconstructed from memory of published sources and should be verified against original transcripts before external citation. The claims about agent system behaviour are working hypotheses to be tested against deployment experience.