AGI Adoption Stages

October 27, 2025
blog image

The next decade will not be defined by a single “AGI moment,” but by a stepwise transfer of agency from humans to machines. What changes is not the raw capability curve — that is already visible — but the locus of control. Each stage moves one layer of cognition, planning, and execution out of human hands and into machine autonomy, while humans migrate upward into governance, rule-setting, and exception-handling.

In the early stages, humans remain explicit operators. AI systems act as high-bandwidth executors and planners, but only inside the shape the human provides. Specification, approval, and responsibility remain in the human domain; AI functions as an extension of the operator’s will.

As systems mature, the bottleneck moves from “what the AI can do” to “how we control what it does.” AI begins to propose plans, revise them mid-flight, and act with partial autonomy. Humans no longer instruct every step — they control the envelope within which steps are allowed to happen. Oversight becomes exception-based rather than continuous.

Later, as performance, verification, and constraint-compliance mature, AI becomes outcome-bound rather than step-bound. Humans define the ends and the red lines; AI finds the means. The role of the human tilts from instructing to arbitrating — they intervene only when the system escalates, not to continuously steer execution.

In still later stages, the human ceases to manage work and instead manages the rules of work. The human function becomes constitutional: to set the normative, legal, ethical, and safety conditions under which AI is allowed to operate. AI becomes the executor of reality; humans become the authors of constraint environments.

At the final stage, humans specify intent — not method, not plan, not constraints. “This is what must become true.” The machine owns the conversion from intent to strategy to execution to audit, while humans retain sovereignty only at the level of legitimacy, not mechanism.

This trajectory is not optional — it follows from the economics of scale, the speed advantage of autonomous decision loops, and the eventual impossibility of keeping humans in every loop without destroying the value of autonomy. When systems act faster than humans can supervise, governance replaces micromanagement as the only coherent control instrument.

The central question therefore shifts from “What can AGI do?” to “At each rung of the autonomy ladder, what remains the non-automatable human function?” The answer is consistent across domains: when machines take over doing, humans must rise to governing — or become irrelevant to the work they once performed.


Summary

Stage 1 — Explicit Instructor

Logic of the stage
AI is treated as a deterministic power-tool. The human specifies not only the desired output but the methodology, constraints, and intermediate structure. The AI is not allowed to reinterpret intentions or optimize — only to execute faithfully.

What must exist / be true for this stage to work

  • Human instructions are explicit, unambiguous, and checkable.

  • Execution is reversible (rollbacks, drafts, sandboxes).

  • Tool use is safe and contained.

  • Output is inspected before being accepted.

Architectural primitives implied

  • RAG for grounding (no hallucinated claims)

  • ReAct or function-calling for tool execution

  • Policy filters & safety guardrails on IO

  • Immutable logging of tool calls and outputs

  • Human approval gate for finalization


Stage 2 — Co-Planner with Human Primacy

Logic of the stage
Humans stop hand-specifying methods; they specify goals and constraints. The AI now proposes structured decompositions and strategies. But humans retain total control over which plan is adopted.

What must exist / be true

  • The AI can reason in structures, not only in prose.

  • Multiple strategies can be generated and compared.

  • Plans must be self-justifying (cite evidence, state assumptions).

  • No execution begins without human plan acceptance.

Architectural primitives implied

  • Tree-of-Thoughts / deliberative search for multi-plan generation

  • Reflexion/critic loops for self-revision before presenting to humans

  • Retrieval-anchored planning (citations supporting each branch)

  • Constitutional filters checking plans against constraints

  • Versioned storage of rejected vs approved plans


Stage 3 — Delegated Execution under Constraints

Logic of the stage
Human approves a plan only once. The AI is now allowed to execute autonomously within a predefined constraint envelope (budget, policies, forbidden actions), and must escalate only when boundaries are threatened.

What must exist / be true

  • Constraints are clear, machine-checkable, enforceable at runtime.

  • The AI can act without supervision while staying inside the envelope.

  • Uncertainty/violation leads to halting or escalation.

  • Every action is logged and reproducible.

Architectural primitives implied

  • Planner–Executor split with constraint enforcement

  • Sandboxed tool environments and allow-lists

  • Uncertainty detection & abstention routing

  • Immutable action logs + evidence traces

  • Human-on-exception, not human-on-every-step


Stage 4 — Self-Improving Executor with Oversight

Logic of the stage
The AI is allowed not only to execute the accepted plan but to revise it if reality contradicts prior assumptions — but revisions must be justified and approved before adoption.

What must exist / be true

  • The AI can monitor the adequacy of its own plan.

  • Plan revisions are treated as proposals needing governance.

  • Self-critique is internal before escalation.

  • Revisions are reversible and auditable.

Architectural primitives implied

  • Actor–Critic–Editor (ACE) loops with justification channel

  • Verifier-gated plan modifications

  • State + reasoning logs for rollback/comparison

  • Change-impact estimation before switching

  • Policy fences remain binding during revision


Stage 5 — Outcome-Bound Autonomy

Logic of the stage
Humans no longer approve plans. They specify outcomes and red-lines, and the AI is free to determine means, adapt strategies, and coordinate sub-agents — provided it stays within guardrails and escalates only on conflict/uncertainty.

What must exist / be true

  • Outcomes are expressible as measurable goals.

  • Guardrails are enforceable at runtime (not post-hoc).

  • The system can replan on its own without losing compliance.

  • Accountability survives free-form autonomy.

Architectural primitives implied

  • Constrained RL / Safe MPC (optimize with hard limits)

  • Uncertainty gating for high-risk or low-confidence states

  • Multi-agent orchestration with shared memory

  • Constitutional checks embedded in inference path

  • Decision dossiers (what, why, alternatives, risks)


Stage 6 — Institutional Governor, Not Operator

Logic of the stage
Humans stop managing work; they manage the rules of work. They author and update constitutions, escalation logic, and legitimacy criteria. The AI operates continuously under these governance contracts.

What must exist / be true

  • Norms, not humans, must constrain action at run-time.

  • Agents must self-audit and expose reasons to inspectors.

  • Escalation is triggered by policy, not by human vigilance.

  • Legibility becomes a condition of autonomy.

Architectural primitives implied

  • Constitutional AI applied at inference time

  • Parallel verifiers (safety, legal, compliance) gating execution

  • Immutable audit fabric with replay and proof obligations

  • Escalation routers driven by policy triggers

  • Separation of powers (planner ≠ verifier ≠ executor)


Stage 7 — Wish-Level Intent Specification

Logic of the stage
Humans express only “what reality should become,” not how to achieve it or how to constrain it stepwise. The AI translates wishes into governed goals and acts end-to-end.

What must exist / be true

  • Intent can be converted into machine-interpretable goals.

  • Ambiguity triggers abstention, not improvisation.

  • Constitutions outrank efficiency and remain binding.

  • Full-chain accountability (intent → means → outcome) is preserved.

Architectural primitives implied

  • Intent-to-goal inference with uncertainty margins

  • Holistic planning/execution/repair cycles under constitutions

  • Persistent normative memory (precedent-based resolution)

  • Verifiable causal dossiers for every major decision

  • Final sovereignty at the level of rules, not operations


The Stages

Stage 1 — Explicit Instructor

Description

Humans specify exactly what to do and how to do it; the AI executes within those instructions without reinterpretation.
The AI may fill local gaps and call tools, but only inside the user’s declared frame.
All outputs remain subject to human approval; autonomy is bounded and reversible.
This stage treats AI as a powerful executor — not a planner, not a governor.

Assignment for the AGI

  • Execute precise instructions exactly as written (no goal re-interpretation).

  • Fill gaps tactically (generate code/tests/snippets/outlines) while preserving the user’s stated structure and constraints.

  • Use tools on demand (search, calculator, code runner, data loader) and attach evidence (citations, logs, diffs).

  • Ask only blocking questions when instructions are genuinely underspecified (otherwise proceed).

  • Return artifacts in ready-to-use form (PRs, formatted docs, datasets, scripts), plus a short “what I did/what I assumed” note.

Assignment for the human

  • Specify the task and acceptance criteria (inputs, outputs, constraints, done-ness checks).

  • Provide sources and boundaries (approved docs/corpora, style guides, repos, data).

  • Choose orchestration level (draft-only vs. draft+run tests vs. draft+run tools).

  • Review/approve outputs, and amend specs if the result reveals missing requirements.

  • Own sign-off & risk: humans are the operators; the AGI is a power tool.

Capabilities the system must have (Stage-1 scope)

  • Robust instruction following with clear constraint honoring.

  • Grounded retrieval (attach/quote sources; avoid hallucination).

  • Safe tool use (sandboxed execution, timeouts, resource/permission limits).

  • Lightweight planning (task decomposition) without changing the user’s objective.

  • Basic uncertainty handling (calibrated confidence + abstain/ask mechanisms).

  • Provenance and diffs (trace every claim/change to its source or test).

Architectures we’ll need (pulled from your AGI architecture stack)

  • LLM + Retrieval (RAG) as the default backbone for factual tasks.

  • Reason–Act interleaving (ReAct) so the model can call tools, read observations, and continue.

  • Short-term working memory (scratchpad for intermediate steps; ephemeral by default).

  • Policy/guard layers (input/output filters, prompt-injection defenses, PII/DLP checks).

  • Verifier plug-ins (unit tests, static analyzers, linters, citation checkers) on the execution path.

  • Audit bus (immutable logs of prompts, tool calls, files touched, and evidence used).

System of control (focus)

  • Human-in-the-loop gates: nothing merges, ships, or emails customers without human sign-off.

  • Least-privilege tool sandbox: allow-listed tools, read-only by default; credential vaulting; network egress rules.

  • Abstention & escalation: if confidence < threshold or constraints conflict, stop and ask.

  • Deterministic environments: per-task containers with pinned deps; reproducible seeds; timeouts and quotas.

  • Evidence-by-design: every output cites sources, shows diffs/tests, and records decisions for audit.

  • Red-team inputs: prompt-injection detection on retrieved pages and tool outputs before use.

  • Kill switches: operator can halt jobs, roll back artifacts, and revoke tokens instantly.


Closest papers / algorithms / architectures that get us to Stage 1

  1. InstructGPT / RLHF — baseline for faithful instruction following; aligns models to comply with user intent and tone while avoiding unsafe behavior.

  2. DPO (Direct Preference Optimization) — simpler, stable alignment method (no explicit reward model/RL loop) for following instructions and preferences.

  3. RAG (Retrieval-Augmented Generation) — grounds answers in approved corpora with citations; key to provenance and freshness in Stage 1.

  4. ReAct (Reason + Act) — scaffolds the loop: Thought → Action (tool) → Observation → Thought; enables stepwise tool use with traceability.

  5. Toolformer / function-calling paradigms — models learn when/how to call calculators, search, code interpreters, etc., with arguments and result fusion.

  6. Self-Consistency & Tree-of-Thoughts (inference-time reasoning) — improves reliability on multi-step problems without changing objectives; pairs well with verifiers.

  7. Uncertainty & OOD baselines (Deep Ensembles / MC-Dropout) — practical calibration so the system knows when it doesn’t know and can abstain/escalate.

(Nice add-ons for dev teams:)

  • RETRO for parameter-efficient, retrieval-heavy knowledge tasks.

  • Static analysis + unit-test generation as verifier modules (e.g., property-based tests, mutation testing) directly wired into the loop.

  • Safety stacks (Constitutional AI / policy classifiers) to keep outputs and tool calls within organizational norms.


Stage 2 — Co-Planner with Human Primacy

Description

Humans no longer dictate step-by-step execution — they define the problem space, constraints, and goals, and the AI proposes structured solutions.
The AI engages in decomposition, trade-off analysis, and alternative plan generation, but the human approves the plan before execution.
Autonomy is still conditional and revocable — the AI does not change goals, only proposes plans to reach them.
The human is still the sovereign decision-maker; the AI becomes a planning partner.


Assignment for the AGI

  • Produce multiple candidate decompositions and justify trade-offs (cost, speed, risk, reversibility).

  • Expose unknowns explicitly and request clarifications instead of assuming.

  • Link each sub-plan step to evidence or rationale from retrieval/tool calls.

  • Maintain internal consistency between goals, constraints, and sub-steps.

  • Stop before execution unless a plan is explicitly accepted.


Assignment for the human

  • State the goal, boundaries, and any unacceptable regions (budget, risk, ethics, policies).

  • Evaluate and select or edit AI-proposed plans; reject reasoning shortcuts.

  • Clarify ambiguities rather than delegate them implicitly.

  • Decide when a plan is sufficiently specified to authorize execution.

  • Remain responsible for direction, not mechanics.


Capabilities required at Stage 2

  • Structured task decomposition (hierarchical reasoning with explicit rationales).

  • Trade-off evaluation and alternative generation (not just single-path planning).

  • Evidence-grounded planning (retrieval/tool-backed rationales).

  • Basic model of constraints and forbidden actions.

  • Reliability under uncertainty via abstention and clarification prompts.


Architectures needed (mapped to original AGI stack)

  • Deliberative skeletons (Tree-of-Thoughts / multi-path search) to produce alternative plans.

  • Retrieval-anchored reasoning to justify branches with citations.

  • Planner–critic loop so the AI can refine plans after self-evaluation.

  • Guard/constitution layer to enforce constraints before proposing plans.

  • Memory of design history (why a plan was rejected, what constraints were binding).


System of control

  • Human approval gate over plans — no execution without explicit confirmation.

  • Plan provenance — every sub-step traced to evidence or assumption.

  • Conflict detectors — block plans that violate declared constraints or policies.

  • Abstention clauses — require escalation when ambiguity or risk exceeds threshold.

  • Immutable record of all candidate plans, rejections, and rationales for audit.


Closest papers / methods / architectures enabling Stage 2

  1. Tree of Thoughts / Deliberate Decoding — structured branching search enabling alternative plan proposals rather than single-shot answers.

  2. Self-Consistency — consensus across multiple reasoning paths to reduce hallucinated single-path failure.

  3. ReAct + Retrieval — interleaving reasoning with evidence and tool outcomes during planning, not after execution.

  4. Reflexion / Critic-of-self loops — self-evaluation before presenting output to the user.

  5. Constitutional AI / Policy Guardrails — plan-level constraint checking, not only output filtering.

  6. Process-supervision approaches — rewarding or training on good intermediate reasoning, not only end results.

  7. RAG with provenance logging — grounding plan rationales in traceable sources.


Stage 3 — Delegated Execution Under Human Constraints

Description

The AI is no longer only a planner — it is allowed to execute the approved plan autonomously, but only inside an explicit constraint envelope set by the human.
Execution is bounded: the AI may act, call tools, modify artifacts, and iterate — but must escalate if constraints are threatened or uncertainty rises.
Human oversight becomes exception-based rather than step-based: the human intervenes only when the system flags a deviation or risk.
This stage produces real work output with reduced human micro-management, but still under tight authorization.


Assignment for the AGI

  • Execute the accepted plan without deviating from constraints (budget, scope, APIs, safety rules, policy).

  • Call tools, run code, retrieve sources, write commits, or generate drafts as needed without re-approving every step.

  • Monitor for violations, surprises, or low-confidence states and stop or escalate accordingly.

  • Produce verifiable artifacts (diffs, evidence, logs, tests) for all work done.

  • Maintain a live status of progress and remaining uncertainties.


Assignment for the human

  • Define the constraint envelope clearly (allowable actions, forbidden regions, resource caps, stop conditions).

  • Approve the plan once; then supervise by exception rather than step-by-step.

  • Review escalations, refine constraints when needed, and re-authorize execution.

  • Audit the produced artifacts and sign off on completion or continuation.

  • Remain accountable for boundary design, not for intermediate actions.


Capabilities required at Stage 3

  • Reliable tool-use execution across code, data, systems, and documents with safety wrappers.

  • Constraint-consistent behavior — honoring budgets, compliance, and policy rules mid-run.

  • Uncertainty detection & escalation — do not continue when confidence collapses.

  • Incremental provenance — record each action with evidence and rationale.

  • Self-monitoring — detect drift from plan or constraints without human prompting.


Architectures needed

  • Planner → Executor split with constraint checking (two-layer agent or meta-controller).

  • Runtime policy enforcement (guard models, allow-lists, sandboxed execution, DLP).

  • Error & anomaly monitors for tool outputs, data shifts, and policy violations.

  • Stateful memory/logging of execution trajectory for post-hoc audit and rollback.

  • Escalation logic coupled to uncertainty/conflict thresholds.


System of control

  • Constraint-first governance — autonomy is conditional not absolute.

  • Human veto on escalation — agent stops and waits on boundary violation.

  • Immutable action log with evidence for forensic and contractual accountability.

  • Kill-switches / rollback integrated at execution level.

  • Dual-key actions for any high-risk step (AI proposes, human co-signs).


Closest papers / architectures / algorithms enabling Stage 3

  1. ReAct + Toolformer — practical scaffolding for autonomous multi-step tool execution.

  2. RETRO / RAG-verified action selection — retrieval-grounded decisions during execution.

  3. Reflexion / Verifier-in-the-loop — self-critique during execution phases.

  4. Safe RL / Constrained RL — optimization under hard constraints rather than reward-only.

  5. Deep Ensembles / MC-Dropout for abstention — escalation when uncertain.

  6. Policy/Guard stacks (Constitutional AI, DLP, allow-lists) as execution-time gates.

  7. CI/CD-integrated agent frameworks — agent commits gated by tests/static analyzers.


Stage 4 — Self-Improving Executor with Oversight

Description

The AI not only executes a human-approved plan under constraints — it is now permitted to revise, optimize, or replace parts of the plan during execution when new evidence or performance signals justify it.
The human no longer dictates the path; they supervise the governance of change, not the change itself.
The AI must provide justified deltas, showing why a different approach is superior and safe before switching.
Execution becomes adaptive rather than static, but still subject to reversal and audit.


Assignment for the AGI

  • Execute the plan while monitoring for better alternatives or failures of assumptions.

  • Propose plan modifications with explicit justification (evidence, metrics, counterfactuals).

  • Do not self-rewrite silently: changes must be logged with rationale and constraint checks.

  • Maintain continuous uncertainty monitoring and escalate if the safety envelope is threatened.

  • Produce incrementally verifiable artifacts and maintain an audit trail of both actions and reasoning.


Assignment for the human

  • Approve or reject plan changes rather than individual steps.

  • Adjust constraints or governance rules when evidence supports modification.

  • Oversee exceptions, not execution; act as arbiter of reasoning quality and risk, not implementer.

  • Maintain accountability for thresholds, approvals, and escalation policy.


Capabilities required at Stage 4

  • Meta-reasoning: detect when current plan is suboptimal or invalid.

  • Self-critique & self-revision while staying inside governance constraints.

  • Delta-justification: explicit, evidence-linked argument for change.

  • Continuous evaluation: real-time metrics, anomaly detection, drift detection.

  • Reversible autonomy: ability to revert or roll back changes deterministically.


Architectures needed

  • Actor–Critic–Editor loops where the system can revise its own output with a justification channel.

  • Verifier-gated modifications — changes must clear constraint and safety checks.

  • Persistent memory of decisions and rejections to avoid cycling.

  • Uncertainty-aware control layer dictating when to proceed vs escalate.

  • Policy layer with dynamic constraints (some constraints modifiable only by human keys).


System of control

  • Human gate on plan revisions instead of micro-gates on actions.

  • Versioned audit of intent → plan → revisions → rationale → actions.

  • Change justification required for every deviation from prior approval.

  • Automatic stop on violation of constraints or low-confidence spikes.

  • Rollback ready for any autonomous delta.


Closest papers / algorithms / architectures enabling Stage 4

  1. Reflexion / Self-Critique frameworks — structured self-revision loops.

  2. Process supervision — supervision on intermediate reasoning, not only outcomes.

  3. Debate + Verifier frameworks — adversarial improvement of plans with adjudication.

  4. Constrained RL / Safe RL — policy improvement under hard constraints.

  5. Tree-of-Thoughts with pruning & replanning — replacing branches mid-search.

  6. Uncertainty-driven abstention (ensembles/MC-dropout) to trigger human oversight.

  7. Actor–Critic–Editor agent stacks used in emerging autonomous research/engineering agents.


Stage 5 — Outcome-Bound Autonomy

Description

The AI is authorized to choose its own strategies and tools to deliver a declared outcome, as long as it stays within explicit guardrails (safety, ethics, budget, policy, SLAs).
Humans no longer pre-approve plans or steps; they define ends and constraints, and adjudicate escalations and post-hoc accountability.
The system adapts online, re-plans, and coordinates sub-agents to meet targets, but must halt or escalate when risk/uncertainty exceeds thresholds.
This is the first stage where autonomy is primarily outcome-driven, not procedure-driven.


Assignment for the AGI

  • Deliver the target outcome (KPIs/SLAs) within budget, timeline, compliance, and safety constraints.

  • Select, sequence, and coordinate tools/agents; redesign approaches as evidence changes.

  • Monitor uncertainty, risk, and constraint adherence continuously; abstain/escalate on violations.

  • Keep a tamper-proof record of plans tried, evidence, actions, and rationale.

  • Provide post-hoc explanations: why chosen, what alternatives were considered, and counterfactuals for misses.


Assignment for the human

  • Specify goals, metrics, constraints, and unacceptable states (red lines).

  • Set authority limits (budgets, scopes, approval ladders) and define escalation thresholds.

  • Review exceptions (breaches, near-misses, high-impact deltas) and adjust policy/guardrails.

  • Own governance quality: clarity of objectives, fairness, and legality—not step-level decisions.

  • Conduct after-action reviews to refine constraints and institutional learning.


Capabilities required at Stage 5

  • Goal-conditioned planning & re-planning with multi-objective optimization (cost, risk, fairness, quality).

  • Constraint-aware control (hard/soft constraints, CMDP reasoning) with real-time violation detection.

  • Uncertainty-aware decision making with calibrated confidence and abstention policies.

  • Multi-agent orchestration (division of labor, scheduling, conflict resolution, shared memory).

  • Persistent provenance & accountability (who/what/why logs; counterfactual analysis).

  • Impact-aware execution (canaries, rollbacks, blast-radius limits).


Architectures needed

  • Meta-controller over planner/executor agents that optimizes outcomes under policy/constraint layers (constitutional rules, allow-lists, caps).

  • Constrained planning stack (e.g., search/MPC with barrier functions or Lagrangian relaxations) integrated with tool APIs.

  • Risk & uncertainty services (ensembles, change-point detection, OOD, tail-risk estimators) gating actions.

  • Rightsized memory: shared episodic/semantic stores for goals, contracts, runbooks, and prior incidents.

  • Governance bus: immutable event ledger, policy checks, duty-of-care verifiers, and audit hooks on the execution path.

  • Escalation engine that routes to humans based on risk × reversibility × novelty.


System of control

  • Ends-over-means contract: authority is tied to outcomes and revocable upon breach or low confidence.

  • Capability gates: budget caps, scope whitelists, rate limits, and dual-key approval for high-impact actions.

  • Shadow→canary→generalize rollout: new strategies must pass staged exposure with auto-rollback.

  • Live compliance monitors: policy classifiers, DLP, safety shields, and fairness checks run pre- and post-action.

  • Red-team-in-prod: continuous adversarial probes to test jailbreaks, prompt/command injection, and tool misuse.

  • Accountability artifacts: decision dossiers (goal, options, chosen plan, evidence, risks, mitigations, outcomes) for every major action.


Closest papers / algorithms / architectures enabling Stage 5

  1. Constrained MDPs / Safe RL (e.g., Lagrangian methods, CPO) — optimize reward subject to explicit cost/safety budgets; natural fit for outcome-with-guardrails control.

  2. Model Predictive Control (MPC) with safety shields / control barrier functions — plan over a horizon while enforcing hard constraints at runtime; practical for continuous re-planning.

  3. Multi-objective / Pareto optimization for agents — formalize trade-offs among cost, quality, risk, fairness; select operating points via policy.

  4. Uncertainty stacks (deep ensembles, change-point/OOD detectors) — calibrate risk, trigger abstention/escalation, and adjust exploration vs exploitation.

  5. Debate/Verifier + Process-Supervision — strengthen plan quality and provide reviewable intermediate reasoning for accountability.

  6. ReAct/Toolformer-style tool ecosystems with policy guards — autonomous tool orchestration under constitutional rules and allow-lists.

  7. Tree-of-Thoughts / Replanning search — swap strategies mid-trajectory with justification and pruning, aligned to outcome metrics.


Stage 6 — Institutional Governor, Not Operator

Description

Humans no longer supervise how the AI works or which plan it executes. They author the governance layer itself — the rules, constraints, escalation policies, accountability formats, and legitimacy conditions under which autonomous agents operate.
Day-to-day work is done by AI systems; human effort concentrates on oversight design, adjudication of disputes, and revision of constitutions, not on production activities.
The locus of human power migrates from execution and planning to policy-level control over what is allowed, by whom, under what guarantees, and with what transparency mechanisms.


Assignment for the AGI

  • Operate continuously within existing constitutions, constraints, and audit protocols without needing stepwise approval.

  • Escalate only when governance rules demand escalation (risk threshold, ethics trigger, conflict of interest, uncertainty failure).

  • Record actionable, legible accountability artifacts for all significant decisions or impacts.

  • Obey policies even when they degrade efficiency; compliance outranks performance.


Assignment for the human

  • Define and update rules of operation (constitutions, guardrails, forbidden regions, auditing duties, proof obligations).

  • Decide exceptions, appeals, and conflicts when the AI surfaces an escalation or normative ambiguity.

  • Evaluate not outputs but governance adequacy — refining incentives, constraints, and oversight structure.

  • Ensure institutional legitimacy: compliance, traceability, fairness, and public defensibility.


Capabilities required at Stage 6

  • Policy-conditioned agency — agent must internalize rules as hard boundaries, not recommendations.

  • Self-auditing / self-reporting — agents must pre-emptively document evidence, risks, and divergences.

  • Normative alignment to constitutions — obey high-level rules without per-instance instruction.

  • Conflict detection & escalation logic — recognize when policy-level judgment is required.

  • Stable operation under imperfect rules — don’t “optimize around” governance gaps.


Architectures needed

  • Constitutional layer at inference time — not just at training; rules must bind execution.

  • Multi-layer verifiers — factual, safety, legal, ethical, compliance as parallel gating stacks.

  • Immutable audit substrate — tamper-proof logs of reasoning, evidence, and decisions with replayability.

  • Escalation switchboard — routes disputes to human governors based on policy conditions.

  • Separation of powers — planner, executor, and verifier roles cannot collude; enforce architectural checks.


System of control

  • Governance-over-action: humans regulate the rules, not the run-time details.

  • Tiered authority — high-impact classes require multi-human or institutional approval.

  • Legibility requirement — no opaque decisions are accepted as legitimate.

  • Norm-binding — systems must degrade to abstention rather than act in policy-uncertain zones.

  • Periodic constitutional review — governance itself is audited and improved, not assumed correct.


Closest papers / algorithms / architectures enabling Stage 6

  1. Constitutional AI — explicit rule-sets steering behavior during inference, not just during training.

  2. Debate + Adjudication frameworks — structure by which competing rationales surface for human governors to resolve.

  3. Process Supervision & Verifier Models — reason-trace inspection and policy conformity, not just outcome correctness.

  4. Audit-grade provenance systems — RETRO/RAG with cryptographic logging and citation enforcement.

  5. Safe RL with hard constraints — policy-bounded autonomy with mandated abstention on rule conflict.

  6. Governance-first architectures — role-segregated agent stacks (planner/actor/verifier/safety arbitrator).

  7. Escalation logic & uncertainty gating — decision to hand control back to humans is part of the policy itself.


Stage 7 — Wish-Level Intent Specification

Description

Humans no longer specify plans, constraints, or procedures directly. They express intent at the level of ends (“make this true in the world”) and the system autonomously determines and governs the means under already-established constitutional rules.
The AI stack becomes a goal-realization engine inside a policy box: the human states direction; the system handles design, planning, execution, correction, and compliance.
Human agency moves fully to meta-sovereignty: defining what should count as success, acceptability, safety, and legitimacy — not how to reach it.


Assignment for the AGI

  • Interpret high-level intent into structured goals without human breakdown.

  • Generate, select, and revise strategies automatically under governance constraints.

  • Detect when intent collides with constitutional rules and request human clarification.

  • Self-monitor and self-correct without waiting for supervision.

  • Deliver the achieved state plus explanatory dossier and counterfactual justification.


Assignment for the human

  • Express ends, not means — the “what” and the “why”, not the “how”.

  • Maintain and evolve constitutional boundaries (ethics, safety, legality, fairness).

  • Arbitrate only those cases where intent conflicts with norms or where the system abstains.

  • Validate outcomes, not intermediate choices.

  • Provide meta-oversight of the alignment framework, not the execution.


Capabilities required at Stage 7

  • Goal inference from underspecified natural intent without distorting user intent.

  • Fully autonomous search/plan/execute/reflect loops inside constraint envelopes.

  • Norm-preserving optimization — outcomes must satisfy constitutions even if cheaper violations exist.

  • Abstention on normative ambiguity — when unsure of the user’s implied social contract, stop.

  • Global accountability — produce legible, audit-grade rationales for the entire causal chain.


Architectures needed

  • Intent-to-goal translators with uncertainty flags (semantic → operational goal mapping).

  • Unified planning/execution stack with built-in reflectivity and constraint shields.

  • Constitutional filters at every stage (interpretation, planning, action, revision, evaluation).

  • Persistent normative memory linking past rulings/precedents to new intents.

  • Holistic audit substrate that binds intent, means, and outcomes cryptographically.


System of control

  • Human sovereignty at the level of norms and ends, not operations.

  • AI autonomy inside those norms — means are delegated unless constitutionally blocked.

  • Escalation only on constitutional conflict or unresolved ambiguity.

  • Outcome-based accountability with after-action reviews feeding back to constitutional updates.

  • Stability of governance more important than speed of execution.


Closest papers / algorithms / architectures enabling Stage 7

  1. Constitutional AI (inference-time governance) — rules binding not training-time only.

  2. Debate + Verifier + Adjudication loops — normative conflict surfacing and resolution.

  3. Constrained / Safe RL for goal-directed autonomy — outcomes under legal/ethical bounds.

  4. Process-supervision & reason-trace auditing — proofs of compliant reasoning, not just compliant outputs.

  5. Intent alignment & goal translation work (goal-inference, preference learning, inverse RL) — mapping wishes into safe goals.

  6. Persistent normative memory & precedent systems — reuse of past rulings to disambiguate new intents.

  7. Full agentic stacks with policy-gated autonomy — planning + execution + correction + logging without human micromanagement.