
October 27, 2025

Artificial general intelligence will not erupt into all industries at once; it will advance through domains in the order in which reality permits. The decisive determinants are not ideology but mechanics: the ease of feedback, the reversibility of error, the density of regulation, and the cost of being wrong. This is why code and text will fall first, and medicine and machines will fall last.
What makes this transition hard is that most critical work in the world is not a single act of prediction but a closed loop of perception, interpretation, decision, and consequence. AGI cannot simply “answer questions”; it must act into the world and remain correct after the world moves. This requires six architectural ingredients to co-exist: world-models, planning, self-improvement, layered memory, tool-use, and built-in safety. Missing any one of them collapses reliability at scale.
For early domains like software and research, the loop is cheap and reversible. Code can be rolled back; literature can be re-read; failures are not existential. These domains already show high readiness because symbolic tasks, retrievable evidence, and machine-checkable feedback create a dense learning signal. What remains is mostly engineering: specification extraction, provenance, sandboxing, and governance.
Mid-tier domains like marketing, tutoring, compliance, and climate/energy planning are more brittle. They blend symbolic reasoning with human norms, regulation, or high-stakes interventions. They are ready for co-pilot regimes but not for unbounded autonomy. They will scale only when guardrails (review ladders, constitutions, abstention logic, audit trails) are made structural rather than advisory.
Autonomy in science and industry brings a harder barrier: physical irreversibility. In-silico science is relatively mature—AlphaFold, RFdiffusion, FNO-based emulators, and SDL planners have already shifted the frontier. But the step from simulation to actuation (self-driving labs, robotized plants, logistics control) adds safety envelopes, anomaly detection, and liability frameworks that must mature before autonomy is allowed to execute.
Healthcare is last because it is the only domain where the value of caution exceeds the value of speed. The bar is not statistical superiority but ethical, legal, and institutional legitimacy under uncertainty and tail risk. This imposes requirements no other domain must meet: causal accountability over long horizons, escalations on uncertainty, documented rationales, and regulator-grade evidence chains.
Across all ten domains the necessary pre-conditions are converging: explicit uncertainty estimation, abstention pathways, multi-agent critique, provenance logging, and human-in-the-loop where harm is not recoverable. The frontier is less about more parameters and more about closing the loop: linking model cognition to tools, actions, memory, and verifiers so that decisions are both competent and governed.
Progress to deployment now depends more on institutional change than model weights. Organizations must rewrite procedures, incentives, and accountability so that agents can execute without eroding trust. AGI will not merely replace people; it will force the redesign of the surrounding institutions that currently assume humans are in the loop. Adoption is the hard part, not inference.
Why early: symbolic, testable, decomposable, machine-verifiable; high ROI and low regulatory drag.
Hard bits: missing specs, non-local dependencies, secure tool execution.
Bottlenecks: spec-from-tickets, repo-wide code graphs, hermetic sandboxes, formal checks.
Adoption reality: agent-in-the-loop PRs → merge-on-green for low-risk classes; security and provenance mandatory.
Why early: literature, policy, market, DD work is retrieval-reason-critique; symbolic feedback easy.
Hard bits: truth under uncertainty, provenance, multimodal extraction, bias and agenda.
Bottlenecks: evidence OS, claim–evidence graphs, update/refresh pipelines, argument scaffolds.
Adoption reality: define trusted corpora, review ladders, immutable logs, template-governed outputs.
Why early: AF2/RFdiffusion/FNO show design & PDE surrogates are tractable.
Hard bits: surrogate overconfidence, multi-constraint scoring, novelty vs validity.
Bottlenecks: uncertainty-aware scoring, composite constraints, novelty benchmarks.
Adoption reality: governed loops, provenance, scientist-as-arbiter not hand-operator.
Why next: robotic execution closes the loop from design→experiment→update.
Hard bits: biosafety, expensive feedback, real-world drifts, multi-objective control.
Bottlenecks: experiment-planners under safety budgets, machine-readable protocols, anomaly aborts.
Adoption reality: tiered approval, replication before claims, reskilling lab staff, compliance embedding.
Why middle-early: symbolic, measurable, decomposable tasks; AB feedback.
Hard bits: persuasion ethics, attribution, messy CRM data, multi-objective tradeoffs.
Bottlenecks: CRM/AB integration, regulatory guardrails, causal evaluation.
Adoption reality: human approval of outbound, brand constitutions, instrumented funnels.
Why middle-early: RCTs show gains; tutoring fits adaptive explain-question-remediate loops.
Hard bits: pedagogy ≠ correctness, diagnosing misconceptions, affect & safety with minors.
Bottlenecks: learner-models, pedagogy-aware generation, standards alignment, mastery verification.
Adoption reality: teacher-in-loop, credential alignment, privacy/governance acceptance.
Why middle: rule-dense, document-dense; retrieval-reason-map fits well.
Hard bits: liability, dynamic laws, semantics in prose, combinatorial risk.
Bottlenecks: norm parsing, change-propagation, evidence-to-control linking, abstention rules.
Adoption reality: risk tiers & sign-off ladders, audit trails, re-role lawyers as reviewers.
Why middle-late: emulators beat baselines; decisions high-impact.
Hard bits: tail-risk uncertainty, regime shifts, multi-objective plans, accountability of actions.
Bottlenecks: uncertainty comms, forecast→optimization coupling, fail-safes, regulatory fit.
Adoption reality: copilot first, shadow mode, dual-control, regulatory updating.
Why late: physical irreversibility, safety, liability, sim-to-real gap.
Hard bits: non-stationary reality, multi-robot coordination, human co-presence.
Bottlenecks: uncertainty-aware control, runtime monitors, task grounding, lifecycle governance.
Adoption reality: bounded cells, human authorizers, reskilling, EHS & insurance integration.
Why last: maximal stakes, ethical/legal drag, fragmented systems.
Hard bits: weak labels, long-horizon harm, ethical constraints, integration.
Bottlenecks: abstention/uncertainty, causal eval, normative alignment, regulatory pathways.
Adoption reality: co-pilot only, logged rationales, clinician oversight, institutional legitimacy required.
Software is natively symbolic and machine-checkable: compilation, static analysis, tests, and benchmarks provide cheap, high-frequency feedback signals.
The workflow decomposes well: tickets, sub-tasks, code blocks, and review gates can be orchestrated by hierarchical or multi-agent patterns.
The ecosystem already exposes tools (linters, CI/CD, container builds, package managers, coverage, fuzzers) that AGI can call as cognitive tools.
Specifications are often implicit, ambiguous, or missing; the agent must infer the intent from partial artifacts and context.
Non-local reasoning is required: many bugs emerge only when changes interact with concurrency, security, or cross-service dependencies.
Long-horizon work such as multi-repo refactors or staged migrations requires stable memory, planning, and rollback safety.
Tool execution is itself a security surface (prompt injection, secret exfiltration, malicious dependencies).
Readiness is high for bounded autonomy in drafting code, tests, documentation, and localized refactors under human review.
Readiness is moderate for agentic orchestration across entire repositories when tests and CI guardrails are strong.
Readiness is low for unsupervised large-scale or safety-critical changes where failure cost is high and specification is incomplete.
We need robust mechanisms for converting informal tickets, logs, traces, and architecture notes into executable acceptance tests.
We need persistent, queryable representations of large codebases (AST + call graph + ownership + runtime profiles) for agent reasoning.
We need hermetic, reproducible sandboxes so agents can test safely with no side-effects.
We need strong integration of formal methods (contracts, model checking, fuzzing) into the agent’s main loop, not as afterthoughts.
Adoption must start with agent-in-the-loop PRs and graduate to merge-on-green only where tests and policies enforce safety.
Accountability must be explicit: code-owners, approval gates, and rollback plans must stay intact with agent contributors.
Incentives must reward writing testable specifications and high-signal feedback (not just “doing it manually”).
Security posture must assume the agent is an untrusted actor: run least-privilege, enforce SBOM/allow-lists, and compartmentalize credentials.
Most deliverables are textual or analytical: briefs, literature reviews, market scans, diligence reports, and policy memos map cleanly to RAG + verifier loops.
Evidence, tables, and citations are machine-retrievable; critique and self-check agents can loop over claims to refine reliability.
The tasks are decomposable: searching, clustering, summarizing, drafting, and reviewing can be orchestrated in stages.
Truth cannot always be checked directly; in contested or sparse domains the model must represent epistemic uncertainty explicitly.
Provenance is fragile: claims must remain stably linked to sources, even when pages change or access is restricted.
Multimodal synthesis across PDFs, tables, plots, and code is noisy and brittle in extraction and alignment.
Agenda, framing, and confirmation bias can distort outputs unless systematically counter-argued or adversarially reviewed.
Readiness is high for first-drafting briefs, executive summaries, structured reports, and literature maps when retrieval is coupled with citation checking.
Readiness is moderate for diligence and analytic tasks when spreadsheet modeling, validators, and domain templates constrain the output space.
Readiness is low for high-stakes synthesis in domains with weak ground truth or political/ethical stakes without multi-expert review.
We need durable “evidence OS” pipelines: ingestion, deduplication, OCR, table extraction, citation-graphing, and immutable hashing.
We need claim–evidence graphs that map every statement to its support and to counter-evidence, annotated with uncertainty.
We need scheduled refresh and change-detection so knowledge products do not silently decay.
We need argumentation scaffolds: side-by-side steelman vs strawman comparisons and adversarial critiques by parallel agents.
Organizations must define authoritative corpora, citation policies, and exclusion lists (e.g. no-trust sources).
Review protocols must be explicit: who signs off, on what criteria, at what risk tier.
Templates and standards must be enforced so outputs become interchangeable and auditable, not stylistic.
All agentic research must be logged with immutable provenance so responsibility, compliance, and IP chains are preserved.
Scientific workflows are increasingly symbolic and computational first: protein structure, molecular docking, climate and material simulations live entirely in code and math.
Generative and surrogate models reduce the search space before touching a pipette, making R&D an information discipline first and a wet discipline second.
Feedback loops are available via simulation scores, binding affinity predictions, energy minima, PDE surrogates, or literature evidence, which allow tight iteration without physical cost.
Ground truth scarcity: many scientific hypotheses have no immediate empirical labels, making supervision and calibration difficult.
Surrogate deceit: surrogate models can be confidently wrong and bias downstream search if not uncertainty-aware.
Hidden constraints: domain-specific constraints (thermo-stability, toxicity, manufacturability) are often absent from naïve objective functions.
Novelty vs validity tension: maximizing novelty pushes models off the data manifold; maximizing validity collapses to known basins.
High for protein structure and design tasks due to AlphaFold-class predictors and RFdiffusion-class generators.
Moderate for PDE-governed domains due to FNO/GraphCast/FourCastNet-style emulators showing production-relevant fidelity.
Low for truly autonomous theory-formation with correctness guarantees; high-level conceptual synthesis still requires expert interrogation.
Uncertainty-aware scoring loops that penalize overconfident surrogates and seek information gain, not just objective maximization.
Composite objective functions that integrate manufacturability, toxicity, ethical constraints, and real-world feasibility into the optimization loop.
Benchmarking for genuine novelty and transfer, not merely re-derivation of known solutions.
Transparent claim–evidence graphs that trace all model suggestions to supporting physics, literature, or empirical priors.
Regulatory alignment: use agentic R&D under controlled internal review committees before exposing outputs to external pipelines.
Provenance & auditability: all hypotheses, scores, priors, and intermediate reasoning must be logged for reproducibility and IP claims.
Role redefinition: scientists must shift from “manual operators” to “hypothesis arbiters” who approve and challenge machine-generated proposals.
Incentive redesign: reward labs for validating or falsifying AI-generated hypotheses, not just human-conceived ones.
Once designs are candidate-screened in-silico, robotic wet labs can execute, measure, and loop results back to models, forming a closed, autonomous discovery cycle.
Robotic execution eliminates human latency, allows continuous optimization, and produces standardized, structured data that can be re-fed to learners.
SDLs convert science from episodic manual runs to industrial continuous processes.
Safety & containment: chemical and biological procedures have non-recoverable failure modes and regulatory controls; robots must obey safety envelopes.
Real-world variance: instruments drift, reagents degrade, sensors misread — reality introduces unmodeled noise not present in simulation.
Sparse and expensive feedback: each wet experiment can consume time, money, and scarce materials; exploration must be sample-frugal.
Multi-constraint control: objectives span yield, purity, kinetics, stability, cost, and biosafety simultaneously.
High for narrow optimization loops in chemistry/materials where protocols are stable and objectives are well-defined.
Moderate for bio/therapeutics where safety envelopes and regulatory reporting add delay and friction.
Low for open-ended “generalist” wet autonomy that spans many domains without human curators.
Reliable experiment-planning agents that choose what to run next under explicit safety and cost budgets.
Standardized machine-readable protocols (PPL-equivalents for wet work) so agents can compose and modify procedures deterministically.
Real-time anomaly detection and automatic abort/recovery logic to prevent runaway failures.
Bi-directional data normalization so wet outputs return as structured, model-ingestible information without manual curation.
Governance must define which classes of experiments may run autonomously vs require human approval or dual-control.
Validation infrastructure must exist for independent replication of AI-proposed hits before claiming results or filing IP.
Workforce must reskill from pipetting to supervising, diagnosing, and improving autonomous experiment pipelines.
Legal & compliance units must extend SOPs, insurance, audit, and incident-reporting to autonomous agents, not only humans.
Most outputs are symbolic (copy, decks, outreach, segmentation, strategy memos), which map cleanly to agentic RAG + critique workflows.
The work decomposes well: research → segmentation → message crafting → A/B plan → iteration based on metrics.
Many feedback signals (CTR, reply rate, conversion, sentiment) are measurable and can drive continual optimization.
Objectives are multi-dimensional and noisy (brand equity, trust, persuasion vs compliance vs speed).
Persuasion tasks risk misalignment with ethics, law, and reputation; strong safety and policy layers are required.
Data quality is uneven: CRM data, campaign logs, and customer segments are often messy, sparse, and siloed.
Attribution is non-trivial: multiple simultaneous channels obscure causal effects.
Readiness is high for content generation, copy variation, ideation, campaign concepts, and narrative frameworks under human review.
Readiness is moderate for analytical tasks such as persona extraction, funnel diagnostics, and opportunity sizing when instrumented with data access.
Readiness is low for fully autonomous campaign execution with budget authority; risk, compliance, and brand liability require gated oversight.
Clean integration with CRM, analytics, AB testing, and attribution layers so agents learn from real feedback, not static prompts.
Guardrails for regulatory, reputational, and ethical constraints (claims compliance, disclosure, fairness, political constraints).
Stable evaluation surfaces: standardized KPIs and uplift tests per channel to avoid optimizing the wrong surrogate.
Automated causal inference hooks (uplift modeling / counterfactuals), not just correlational dashboards.
Redefine roles so human marketers supervise, constrain, and interpret agent proposals rather than manually producing all assets.
Require human approval for outbound actions and budgets; log and audit all generated messaging.
Train teams to instrument campaigns so learning signals exist (without metrics, the agent cannot improve).
Establish brand policies and tone rules as machine-readable constitutions used by agents at generation time.
Personalized tutoring maps well to LLMs’ ability to explain, question, assess, and adapt in dialogue.
Curriculum decomposition allows hierarchical teaching plans (concept → example → check → remediation → spiral return).
RCTs already show AI tutors can outperform standard classroom methods on learning gain per time.
Pedagogical correctness is not identical to textual correctness; an answer that is “right” may not be instructionally effective.
Student modeling is partial and noisy; inferring misconceptions from short dialogues is non-trivial.
Motivation and affect matter; tutoring requires emotional regulation, not just information delivery.
Safety and ethics are acute with minors: data governance, harmful content, and manipulation risks.
Readiness is high for explanation, drilling, quiz generation, and structured tutoring in constrained domains (math, languages, STEM basics).
Readiness is moderate for personalized remediation and pacing if diagnostics are integrated.
Readiness is low for full curricular autonomy, grading with legal consequences, and high-stakes certification without human intervention.
Rich learner modeling that tracks misconceptions, effort, retention, and engagement longitudinally—not just correctness.
Pedagogy-aware generation: agents must choose how to teach, not only what to answer.
Alignment with standards and curriculum so agent tutoring is recognized institutionally.
Verifiable evaluation loops: human or automated mastery checks must close the loop.
Schools must define when AI tutors may act autonomously and when human teachers certify learning.
Teacher role must shift from “lecturer” to “diagnostician and coach” supervising agent-driven practice.
Parents and regulators must accept privacy, safety, and fairness controls before scale deployment.
Institutions must anchor credentialing and assessment workflows so AI tutoring is not pedagogically invisible or academically illegitimate.
The deliverables are mostly textual, analytical, and rule-constrained (contracts, policies, compliance reports, risk memos, board packs, audits).
Work decomposes hierarchically: ingest → interpret rule/standard → map to entity/process → generate obligations → monitor → report.
Retrieval + structured extraction + reasoning + verification allows machine construction of obligations and controls from laws, contracts and standards.
Precision errors are intolerable: a single wrong clause or misinterpreted obligation creates legal or financial liability.
Knowledge is dynamic: laws, regulations, and internal policies change and cascade into dependencies.
Many constraints have no machine-readable form; semantics live in prose, case law, negotiation history, or regulator intent.
Risk is combinatorial: compliance sits at intersections of jurisdictions, domains, and actors.
Readiness is high for assistive drafting, redlining, policy synthesis, mapping of obligations, and first-pass due-diligence with human oversight.
Readiness is moderate for semi-autonomous monitoring and exception triage when paired with retrieval, rule-engines, and human gates.
Readiness is low for fully autonomous issuance of binding decisions or filings without sign-off.
Trustworthy parsing of norms into structured representations (obligations, prohibited acts, time-bounds, evidence requirements).
Continuous change-detection linking new laws or rulings to affected obligations and controls.
Integrated verification pipelines (compliance evidence → cross-check → audit trail) that are machine-consumable.
Calibration and escalation logic: when the agent should abstain and trigger a human.
Define risk tiers and approval ladders (e.g., agent may draft, but humans sign; agent may file only for low-risk classes under policy).
Build provenance and audit trails of every clause, citation, and inference for defensibility.
Re-role lawyers/compliance staff to reviewers, exception-handlers, and governance architects, not manual drafters.
Align incentives: firms must reward defensibility and auditability, not only speed.
Weather, grid, and logistics are governed by physical or stochastic processes that admit modeling and fast surrogates (GraphCast / FourCastNet).
Decisions (dispatch, routing, hedging, scheduling) can be linked to model predictions, creating closed decision loops.
These systems have huge, measurable consequences; even marginal accuracy improvements have economic and societal leverage.
Downstream actions are safety-/mission-critical (grids, supply chains, disaster response); catastrophic error cost is high.
Models must generalize under regime shift (rare extremes, climate drift, geopolitical shocks).
Many decisions require multi-objective tradeoffs (cost, risk, emissions, fairness, SLAs).
Actionability gap: forecasts must translate into executable plans under constraints.
Readiness is high for forecasting itself (AI emulators already outperform classical baselines on multiple metrics).
Readiness is moderate for decision support (ranked options, scenario stress tests, human-in-the-loop).
Readiness is low for fully autonomous operations without oversight due to risk, regulation, and liability.
Robust uncertainty quantification and communication, especially for tail risks and low-frequency extremes.
Coupling between forecast layer and optimization layer (turning predictions into commitments with constraints).
Simulation-to-decision governance: fallbacks, overrides, and rollback for wrong calls.
Regulatory and market-clearing structures that assume human forecasters.
Deploy AGI as decision copilots first: propose and justify plans; humans retain dispatch authority.
Require post-hoc attribution: log forecast state, options considered, rationale, and chosen action for auditability.
Build institutional trust pathways (shadow-mode operation; dual-control periods; staged authority transfer).
Update regulatory frameworks so algorithmic participation in energy/logistics is legally recognized and bounded.
Industrial processes consist of repeatable physical tasks with measurable quality/throughput/cost metrics.
Vision–language–action models (RT-2, PaLM-E) show transfer from web knowledge to embodied control.
Planning + feedback from sensors allows closed-loop optimization in factories, logistics, and infrastructure.
Embodied errors have physical cost: damage, downtime, safety incidents cannot be “reverted” like code.
Real-world variation (lighting, wear, clutter, weather) breaks brittle policies trained on idealized distributions.
Multi-robot coordination, task allocation, and human co-presence raise complexity and liability.
Edge deployment constraints: limited compute, latency, connectivity, and safety-certifiable stacks.
Readiness is high for perception and local autonomy (detection, grasping, pick-place, inspection under constraints).
Readiness is moderate for task-level autonomy in structured environments (warehouses, fabs, labs) with guardrails.
Readiness is low for generalist unstructured autonomy (streets, construction, disaster zones) without human supervision.
Robust sim-to-real transfer with uncertainty-aware control and active correction, not brittle feed-forward execution.
Safety envelopes with formal guarantees and runtime monitors for collision, force, chemical/bio hazards.
Task decomposition interfaces so high-level intent can be grounded into safe executable sequences.
Lifecycle governance: calibration, drift detection, fault diagnosis, rollback, and incident forensics.
Introduce autonomy in bounded cells first with interlocks and physical segmentation.
Keep humans as verifiers/authorizers; define escalation logic and stop-conditions.
Retrain workforce from manual operation to supervision, exception-handling, and continuous improvement.
Integrate autonomy into EHS, insurance, and liability frameworks before expanding scope.
Healthcare is information-dense, rule-dense, and repetitive — ideal for AI analysis, triage, and recommendation.
Biological design (proteins, drugs, targets) is already being transformed by in-silico models.
Clinical domains have the largest societal benefit per error-prevented — but also the highest cost per error-made.
Ground truth is messy, delayed, or unavailable; outcomes are confounded and patient-specific.
Failure cost is maximal: harm, liability, ethics, regulation, and public trust constraints dwarf all other domains.
Norms encode non-technical values (consent, dignity, fairness, triage ethics) that are not reducible to accuracy alone.
Integration across fragmented systems (EHRs, devices, payers, local laws) is brittle and politicized.
Readiness is high for assistive cognition (summaries, guideline checks, differential suggestions, documentation, coding).
Readiness is moderate for decision support under human sign-off (triage ranking, risk alerts, drug–drug checks).
Readiness is low for autonomous clinical decisions or interventions without human responsibility.
Verifiable uncertainty and abstention mechanisms to force escalation when the system is unsure.
Long-horizon causal evaluation to detect harms that only surface months or years later.
Alignment of AI outputs with ethical/legal care standards, not merely statistical accuracy.
Regulatory pathways for certifying agentic systems, not just static models.
Deploy in co-pilot configuration with hard human-in-the-loop for all consequential actions.
Build audit-by-design: log evidence, rationales, and uncertainty for every recommendation.
Redefine clinician roles toward oversight, interpretation, and patient-facing reasoning.
Engage regulators, malpractice insurers, and ethics boards early; without institutional legitimacy, autonomy cannot deploy.