
Incentives are the deep code of civilization. They sit beneath laws, institutions, and personal decisions, quietly guiding what behaviors are rewarded, what values flourish, and which strategies dominate. Yet this architecture is profoundly fragile. When incentives drift from the goals they were meant to serve, systems begin to optimize for proxies—measurable, manipulable stand-ins that eventually decouple from what matters. This is not a fringe bug. It’s the default behavior of complex systems under pressure. In practice, misalignment emerges not from ignorance or malice, but from bounded rationality operating inside warped reward landscapes.
This creates a deep paradox: intelligent agents—whether humans or machines—can act in highly rational ways that nonetheless produce irrational outcomes. A teacher who drills students for tests instead of cultivating understanding is responding logically to performance evaluations. A startup that prioritizes click-through rates over long-term user well-being is playing the incentives correctly. Even a scientific institution might fund replicable research less than flashy, novel results—not due to incompetence, but because funding and recognition reward publication volume and citations. These behaviors are not mistakes; they’re locally aligned. That’s the danger.
Misalignment is therefore not just a matter of "wrong thinking." It is structural. It is baked into the reward functions of our institutions. And the consequences are cumulative. Every layer of proxy optimization—every moment a short-term signal replaces a long-term goal—adds entropy to the system. Over time, these distortions become invisible, assumed, even institutionalized. When this happens at scale, we encounter civilizational suboptimality: a society where coordination fails not because people are stupid, but because the incentives are misaligned across every level of decision-making.
The complexity deepens when we introduce value conflict. Human values are not only hard to measure—they are inherently pluralistic and often contradictory. We want fairness and efficiency, security and freedom, tradition and progress. The process of surfacing and negotiating these values is political, cultural, and contextual. This means that even the goal of alignment is fuzzy and dynamic. In such a landscape, even well-meaning attempts at system design risk oversimplifying or prematurely formalizing values—thereby reinforcing proxies rather than capturing true intent.
What makes this even harder is that most institutions lack epistemic feedback. Their incentives are not only misaligned—they’re opaque. There's no ground truth for them to compare against. For example, how does a government know if its education system is genuinely cultivating autonomous, critical thinkers? The signal is noisy, slow, and often overwhelmed by louder, more legible metrics (like standardized test scores or graduation rates). This opacity encourages institutions to treat the available data as the truth—even when that data reflects narrow, surface-level success.
Now enter advanced AI. Alignment, in the context of artificial general intelligence, depends on a civilization that can reliably signal what it values. But if those values are encoded in distorted, misaligned proxies, any sufficiently capable optimizer will learn to exploit them—accelerating Goodhart's Law at unprecedented scale. This is the real risk: not a rogue paperclip maximizer, but a perfectly competent optimizer trained on flawed human reward structures. The AI does not go rogue—it follows the signals we gave it, and those signals reflect our own systemic failures.
Thus, AI alignment is not merely a technical problem. It is downstream of institutional epistemology. A system trained on human data will learn what humans do, not necessarily what they mean or wish. To solve AI alignment, we must first demonstrate that we can align humans with their own values, through institutional feedback loops, updated incentive structures, and reward architectures that track actual outcomes—not performative metrics. Alignment must be practiced before it can be programmed.
The good news is that alignment is not mysterious—it is tractable when we treat it as a systems engineering problem. We can create transparent feedback loops, composite metrics that resist gaming, and participatory structures that surface true values over time. We can build systems that reward epistemic humility, that make updating and corrigibility status-enhancing rather than career-threatening. We can train AI models in environments designed for robustness, corrigibility, and cooperative strategy. But all of this requires recognizing misalignment as a design failure—not a human flaw.
Incentive alignment is civilization’s core scalability problem. Every time we delegate power to a process—be it a person, a company, or an AI—we are expressing a belief: that this agent will do what we hope, not just what we measure. But hope is not a design strategy. If we want to build systems we can trust, we must stop optimizing what’s easy and start building what’s true. That begins by facing the hard reality: most of what we reward today is not what we actually want. Fixing that is the first step toward any future worth living in.
Don’t reward what’s easiest to measure—reward what actually matters. Proxy metrics should serve goals, not replace them.
Single metrics invite gaming. Use a bundle of partially redundant signals to keep behavior aligned with complex objectives.
Shift rewards away from short-term optics and toward sustained, verifiable outcomes. Time reveals alignment better than snapshots.
Design systems where being right pays more than being persuasive. Tie influence and reward to predictive power and epistemic reliability.
Misalignment thrives in darkness. Build systems where decisions, reasoning, and consequences are visible, traceable, and debuggable.
Ensure that those acting (agents) benefit when their outcomes serve those who are trusting them (principals). Shared fate enables shared goals.
Celebrate those who change their minds for good reasons. Make belief revision a sign of strength, not weakness.
Train humans and AIs in environments where honesty, cooperation, and corrigibility are structurally rewarded. Context shapes behavior.
Every incentive scheme degrades. Build institutions that can observe, critique, and refine their own reward systems over time.
Allow many legitimate paths to success. Redundancy prevents collapse; pluralism protects against monoculture misalignment.
Reward coalition-building and mutual modeling. Misalignment is accelerated by adversarial dynamics; cooperation is alignment in motion.
AI systems learn from our institutions. If we train them on misaligned systems, we get superhuman misalignment. Fix ourselves first.
Proxy Deconstruction and Goal Realignment
Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”
Goal-Proxy Decoupling: Over time, proxies drift from the real-world goals they once correlated with.
Alignment Preservation: True optimization requires keeping the proxy tightly coupled with the intended outcome.
Most institutions don’t fail because they ignore their goals—they fail because they substitute proxies for the real thing, then forget the original intention. Proxy metrics, once optimized, become gameable and lose their original meaning. Alignment begins by refocusing optimization on the actual desired outcome.
Proxies are used because real goals are often hard to measure. We use test scores as a stand-in for learning, engagement as a stand-in for value, GDP as a stand-in for societal welfare. Initially, these proxies are helpful. But once people start optimizing them directly—tying careers, profits, or status to them—they become targets, and the causal link to the underlying goal weakens or inverts.
A school that maximizes test scores may suppress curiosity. A hospital optimizing for patient throughput may reduce actual care quality. A startup optimizing for user engagement may drive addiction, not satisfaction.
This drift is subtle at first but self-reinforcing. Proxy optimization can become locally rational but globally destructive. It leads to systems that are efficient at the wrong task—a hallmark of misalignment.
Proxy Audits: Periodically review performance metrics in each institution to ask: “Is this still meaningfully correlated with our real goal?” Remove, rebalance, or augment proxies that drift.
Proxy-to-Goal Mapping: Create formal models that document the assumed causal relationship between each metric and its associated real-world goal. Make these assumptions debuggable.
Institutional “Why Reviews”: Require that strategic plans or KPIs be justified explicitly in terms of the true objective. Encourage first-principles thinking before metrics are defined.
Incentive Escalation Checks: Before scaling a reward system, run simulations or sandbox tests to observe whether optimizing the proxy causes collapse in the underlying outcome.
Hybrid Evaluation: Combine quantitative proxies with qualitative assessments (e.g., peer reviews, user satisfaction interviews, direct inspections) to catch proxy drift early.
Redundant Objectives and Multidimensional Fitness
Multi-criteria Optimization: No single metric captures complex goals—combine overlapping signals.
Error Compensation: One metric’s failure can be caught by another.
Pareto Efficiency: Optimize for fronts where trade-offs can be visualized and judged rather than scalarized.
When we rely on a single metric to guide decisions, we invite Goodharting and brittleness. A more robust approach is to optimize across multiple metrics that reflect different aspects of the goal, reducing the chance of gaming any one metric in isolation.
Complex goals (like education, health, justice) are inherently multi-faceted. Trying to collapse them into a single score leads to distortion. For example:
A good teacher isn’t just someone whose students score well, but someone who inspires, mentors, and adapts.
A good doctor balances treatment effectiveness, patient experience, long-term health, and cost-efficiency.
A good AI assistant may need to be helpful, honest, non-manipulative, and explainable.
By optimizing across several dimensions, we reduce the risk of any one proxy being gamed. When metrics are partially redundant, they reinforce one another and create a richer reward signal.
Pareto front optimization also introduces exploratory trade-off thinking: rather than pushing one metric to the max, decision-makers consider various “efficient frontiers” of trade-offs—e.g., between speed and accuracy, or engagement and depth.
Composite Scorecards: Replace scalar KPIs with multidimensional dashboards that track a small set (3–7) of relevant metrics, each representing a different part of the goal.
Pareto Optimization Algorithms: For AI systems, use multi-objective RL or evolutionary strategies that learn a trade-off frontier instead of collapsing everything into a reward scalar.
Weighted Voting Panels: Use multiple evaluators (e.g., students, peers, external auditors) to judge performance from different angles. Incentivize robust performance across the board.
Proxy-Interaction Testing: Run simulations to detect cases where maximizing one metric undermines others. Penalize these failure modes explicitly.
Transparency of Trade-offs: Make trade-offs visible and explainable. Don’t pretend all metrics are commensurable—show the human behind the judgment.
Time-Shifted Incentivization
Temporal Discounting Correction: Humans and institutions naturally undervalue delayed outcomes.
Delayed Gratification Signals: Systems that reward long-term benefits stabilize alignment better than those that reward fast performance.
Outcome-Linked Accountability: Create feedback loops that reward real-world success over time, not short-term appearances.
Many systems misalign because incentives are tied to immediate outputs (e.g., quarterly earnings, daily KPIs), leading to behaviors that maximize short-term success but undermine long-term health. Long-term reward signals foster sustainability, stewardship, and genuine alignment.
Humans are myopic by nature—we discount future rewards. Institutions exacerbate this: politicians think in 4-year terms, executives in quarterly cycles. This encourages behaviors like:
Deferring maintenance
Over-extracting resources
Hiding failure to preserve optics
“Shipping the bug” and fixing it later
None of these are optimal in the long run—but they look good now, and that’s what gets rewarded.
To break this cycle, systems must make long-term success visible and valuable. That means delaying feedback until consequences play out—and building institutions that can track, remember, and reward outcomes after the action is taken.
Retroactive Public Goods Funding: Reward projects based on verified long-term value (e.g., Ethereum Foundation’s quadratic retro funding).
Multi-Year Scorecards: Evaluate public officials, policies, or teachers not just in the year they operate, but 3–5 years later.
Reputation Decay/Boost Systems: Let public reputation systems incorporate post-hoc data—e.g., giving higher trust to predictors who prove correct over time.
Delayed Payout Models: Bonus schemes that mature only if outcomes persist (e.g., “longevity-based” bonuses in public service or executive compensation).
Forecasting-Based Governance: Let near-term actions be judged partially by well-calibrated long-term predictions; penalize forecast failures retroactively.
Truth-Aligned Incentivization
Prediction-Based Rewarding: Reward people for being correct, not persuasive.
Counterfactual Tracking: Value accrues to those whose models of the world match reality—not to those with charisma or clout.
Memetic Filtering: Truth should rise through the system not because it's popular, but because it's verifiably reliable.
One of the strongest misalignment patterns in society is that being confident pays more than being correct. To reverse this, systems must reward truth-tracking: people should be incentivized to seek, state, and act on what is true—even when it’s unpopular or uncertain.
Currently, we live in an epistemic marketplace that rewards attention, not accuracy. A viral but false claim travels faster and more profitably than a boring truth. This distortion incentivizes all kinds of epistemic misalignment:
Media publishing for outrage, not accuracy.
Experts who hedge and lose credibility, while charlatans dominate headlines.
Institutions that under-report uncertainty or overstate claims to gain funding.
To fix this, we must make accuracy visibly profitable—especially in the long run. Reputation, influence, and even compensation should follow the signal of reality-tracking, not just performance theater.
Prediction Markets: Use real-money markets or community-weighted mechanisms to reward forecasters who make accurate predictions (e.g., Manifold Markets, Metaculus).
Truth Audits: Retrospective analysis of public claims, with scoreboards that track who gets it right over time.
Epistemic Reputation Systems: Build platforms where trust flows toward people with well-calibrated track records, not just volume or virality.
Smart Contracts for Truth: Use crypto-based systems to automate rewards for correct predictions (e.g., resolve payouts only when a real-world event settles).
Epistemic Leaderboards: Celebrate and promote those who demonstrate persistent, testable truth-seeking—especially across domains, not just within niches.
Visibility-Driven Accountability
Legibility Enables Correction: You can’t improve what you can’t see.
Transparency Reduces Misalignment Drift: Opaque systems drift from values because no one can verify behavior.
Observable Behavior Shapes Strategy: Agents align more easily to real goals when they know those goals are monitored and interpreted accurately.
Misalignment flourishes in black boxes. When agents (human or AI) act in environments where performance isn’t observable or traceable to outcomes, they can optimize for local tricks and hidden loopholes. Transparent feedback loops increase the cost of deception and the reward for visible alignment.
Systems that lack visibility—like opaque bureaucracies, poorly-instrumented models, or institutions with no follow-up—produce outputs that can’t be linked to responsibility. This encourages superficial optimization and undermines alignment.
By contrast, feedback loops with visibility and interpretability create accountability. If teachers know that long-term student outcomes are tracked, they invest in real learning. If AI systems are evaluated on why they made a choice—not just what outcome occurred—they must expose their reasoning, which can then be debugged or rewarded.
Transparency also supports debuggability. It gives oversight institutions (auditors, regulators, citizens) the tools to spot when proxies are being gamed or when outputs contradict goals.
Legible Audit Trails: Every decision by a system or institution should have a reason traceable to its inputs and justifications (e.g., model interpretability layers, human decision logs).
Transparency Dashboards: Expose real-time performance metrics, process traces, and outcome links to internal and external observers.
Open Source Reasoning Logs: For AI systems, publish not just outputs but chains of reasoning or token-level rationales.
Whistleblower Protection + Incentives: Encourage internal transparency by aligning incentives for truth-tellers who detect misalignment.
Public Feedback Channels: Embed response mechanisms so misalignment can be spotted by users, citizens, or employees—then acted on by leadership.
“Glass Box” Model Evaluation: Shift from “black box AI performs task” to “glass box AI shows how it thinks, and humans can validate intermediate reasoning.”
Principal-Agent Convergence
Skin in the Outcome: People act differently when their welfare is tied to long-term impact, not just process.
Incentive Congruence: Alignment improves when agents and principals share overlapping utility functions.
Misalignment Flourishes in Distance: Agents who are insulated from outcomes (e.g., outsourced contractors, detached bureaucrats) tend to optimize for self-preservation, not mission success.
A core alignment challenge is the principal-agent problem: the person taking action (agent) does not necessarily benefit or suffer from the consequences of their actions the way the goal-setter (principal) does. Fixing this means aligning their interests structurally.
Whether in corporations, governments, or machine systems, the misalignment between the one making decisions and the one bearing consequences is a root cause of incentive drift.
A policymaker may pass laws that are popular now but harmful later.
An employee might make decisions that boost short-term performance metrics while harming long-term company health.
An AI assistant might give a plausible answer instead of a truthful one—because it receives feedback on helpfulness, not accuracy.
To solve this, we must tie rewards and penalties to outcome quality. The agent should benefit when the principal’s values are satisfied and be penalized when they're violated.
This isn’t always possible directly, but indirect proxies (equity, delayed payouts, transparency) can bring agent goals closer to principal desires.
Equity and Profit-Sharing: Give agents real stakes in outcomes—e.g., employees owning equity, researchers rewarded from downstream IP success.
Performance-Based Contracting: Design compensation around verified goal achievement, not effort or process metrics.
Agent-Auditable Objectives: Make it easy for agents to understand what matters to the principal. If the AI or worker doesn’t know the real goal, they’ll guess—and likely guess wrong.
Delayed Feedback Contracts: Introduce lagged bonus systems that tie final rewards to the long-term impact or quality of outcomes.
Fine-Grained Attribution Systems: Use digital traceability to link outcomes (positive or negative) to specific actors—enabling reward redistribution or accountability where it matters.
Epistemic Integrity Incentivization
Updating is Evidence of Rationality: Changing your mind in response to evidence is a signal of alignment with truth.
Confidence Without Correction is Fragile: Systems that punish uncertainty and reward certainty—even when wrong—produce brittle models.
Reputation Should Track Accuracy, Not Stubbornness: High status should follow those who revise beliefs responsibly.
Truth-seeking systems must reward the willingness to update. But in many environments, changing your mind is seen as weakness. To foster alignment, we must reverse this and make honest belief revision a mark of reliability.
In political, academic, and corporate life, people often cling to outdated positions because their reputation, status, or salary is attached to them. This leads to:
Intellectual stagnation
Status games over truth
Models that persist despite disconfirming evidence
Epistemic environments (including AI models) need to incentivize correction, not just confidence. If someone makes a claim and later finds out they were wrong—and corrects it publicly—they should be rewarded, not punished.
This models intellectual humility, creates feedback loops around learning, and fosters environments where people “compete to be right,” not just to be dominant.
Belief Revision Trackers: Let users or agents log public forecasts and update probabilities over time. Reward improvements in calibration.
“Mind Change” Bonuses: Platforms or institutions can gamify belief shifts—celebrating public updates grounded in evidence.
Reverse Hall-of-Fame: Highlight impactful corrections that led to better decisions—e.g., journal retractions that prevented harm.
Uncertainty Tokens: In discussion platforms, allow people to indicate confidence intervals or probabilities—normalize not knowing.
Epistemic Career Capital: Make intellectual humility a pathway to influence—give weight to those with histories of rational updating.
Alignment-Conducive Simulation Environments
Environment Shapes Strategy: The rules of the game determine how players behave.
Alignment is Easier When Aligned Behavior Wins: If lying, manipulating, or proxy-hacking is the winning strategy, agents will do it.
Training Norms Become Habits: AI (and humans) internalize patterns during formative training—so aligned environments produce aligned tendencies.
We can’t expect aligned behavior to emerge from misaligned settings. Whether training humans or AI, we must build environments where alignment is structurally advantageous, and misalignment fails. If we get the simulation right, we shape aligned behavior by design.
A child raised in a violent, chaotic household internalizes different strategies than one raised in a nurturing, cooperative one. Similarly, AI trained in a data environment full of toxic incentives, superficial proxies, or adversarial actors will generalize those patterns.
Instead, we must create environments—virtual or institutional—where honest cooperation, robustness, and epistemic hygiene are rewarded over all else. This is like raising a child in a good village or bootstrapping AGI in a simulation that mirrors cooperative human values.
For LLMs and agentic models, this might mean careful selection of training data, deliberate adversarial probing, synthetic environments that test for deception, and game-theoretic architectures that reward collaborative problem-solving.
Red-Teaming and Adversarial Alignment: Expose agents to environments where dishonesty, deception, or shortcutting are penalized—force them to develop generalizable alignment strategies.
Synthetic Alignment Simulations: Build closed-loop environments (e.g., alignment games or ethical decision-making tasks) where aligned behavior is consistently the best strategy.
Curriculum Design for Agents: Sequence training environments to reward foundational reasoning, uncertainty management, and intent transparency before scaling to complex tasks.
Human-AI Alignment Role-Play: Involve humans in simulation loops that reinforce collaborative goal inference and value extrapolation.
Incentivize “Transparency Habits” in Training: Give rewards for explanation, corrigibility, and ask-for-help behavior during early model shaping.
Incentive Iterability and Meta-Governance
No Perfect Incentive Exists Forever: Every metric or incentive structure will eventually be gamed or obsoleted.
Governance Requires Governance: You need systems in place to adjust the system itself—meta-systems that observe and update the incentive landscape.
Institutional Plasticity: Systems should evolve as feedback and failure patterns emerge.
Incentive schemes degrade over time. If there's no embedded mechanism to regularly reevaluate and tune them, they become liabilities. Therefore, effective systems must not only define good incentives—they must remain open to changing them in response to observed misalignments.
Even the best-designed systems will encounter regime drift. New technologies emerge, people discover edge cases, and cultures evolve. A health system that worked in 2005 may be catastrophically misaligned in 2025.
Without embedded adaptivity, you risk lock-in: outdated incentive structures that actively resist revision because people benefit from their flaws.
Meta-governance mechanisms—like institutional self-audits, rotating evaluation committees, or automated anomaly detection—allow for responsive refinement. This means reward systems aren’t static—they’re subject to continual improvement.
Adaptive systems also build legitimacy: participants are more willing to engage in a system that can learn and respond.
Incentive Lifecycle Audits: Require every major incentive (e.g., KPIs, funding metrics) to be reviewed periodically for misalignment or unintended side effects.
Meta-Institutions: Create units within organizations whose sole job is to monitor, simulate, and revise incentive schemes (e.g., a “Department of Institutional Fitness”).
Feedback Loops on Feedback Loops: Use second-order metrics that track the performance of incentive systems themselves—e.g., “Are our metrics helping us achieve our mission?”
Test-and-Roll Incentives: Treat incentive design as an experiment: A/B test, roll out slowly, gather feedback, and adjust.
Participatory Revision Frameworks: Allow those being incentivized to suggest incentive revisions, challenge perverse structures, and participate in collaborative redesign.
Diversity as Alignment Insurance
No Metric Captures Everything: Diversity in evaluation criteria, strategies, and structures reduces systemic failure risk.
Redundancy Prevents Collapse: If one alignment pathway breaks, others can still function.
Pluralism Guards Against Monopoly Misalignment: Encouraging many ways to succeed reduces the risk of one misaligned dominant approach.
A monoculture of incentives—where only one strategy or metric defines success—is fragile and dangerous. Instead, systems should offer multiple overlapping paths to alignment. This makes systems robust to drift, exploitation, and local failures.
Just as biodiversity protects ecosystems from collapse, strategic diversity protects institutions from overfitting to flawed incentives. If your scientific field only rewards citations, it may neglect replication. But if it also rewards community service, teaching, interdisciplinary synthesis, and long-term impact, then success becomes multifaceted and less gameable.
Pluralism also allows for experimentation—different subgroups can try different strategies. Over time, the system can learn which combinations of behaviors lead to alignment and which don’t.
In AI systems, diverse reward channels (e.g., human feedback, simulated evaluation, self-consistency checks) reduce reliance on any one fragile or corruptible signal.
Multi-Track Success Criteria: Allow individuals or systems to earn trust through diverse modes of contribution: reliability, innovation, consensus-building, ethics, etc.
Decentralized Evaluation Panels: Evaluate contributions via committees with varied values and methods.
Rotating Metrics: Change evaluation metrics periodically to prevent stagnation and gaming.
Incentive Portfolio Design: Like financial portfolios, diversify the incentive landscape—some fast, some slow, some peer-reviewed, some public.
Intentional Overlap: Design systems where the same outcome is supported by multiple, independently functioning incentives.
Cooperation Over Extraction
Incentives Shape Group Dynamics: Systems that reward defection, sabotage, or zero-sum thinking breed misalignment.
Alignment Requires Coalitional Thinking: If your system doesn’t reward coordination, it punishes it by default.
Coordination Reduces Goal Fracturing: By incentivizing joint success, you align local optimization with global benefit.
Alignment is not only about individual intelligence but group strategy. If people, agents, or institutions are rewarded more for competing than for coordinating, misalignment is guaranteed. We must create environments where cooperation outcompetes defection.
Our current institutions often reward “being first,” “beating the opponent,” or “owning the narrative.” These are adversarial incentives. Even when agents recognize common goals, the incentive structure pits them against each other.
By contrast, coordination-friendly systems reward coalition formation, mutual goal clarification, and shared infrastructure. Alignment emerges naturally when people realize that helping each other helps themselves.
For AI, this means training agents that collaborate with humans and each other—learning mutual modeling, intention sharing, and fair negotiation.
Team-Based Incentives: Use group-based performance evaluations and shared success metrics.
Anti-Defection Bonuses: Reward transparency, contribution to commons, or aid to competitors (e.g., via open-sourcing, interoperability).
Coordination Protocol Layers: Within AI ecosystems, build APIs and decision layers that allow agents to share goals, sync plans, and adjust behaviors together.
Coalition-Building Metrics: Reward individuals or groups that help align others—e.g., providing reusable tools, shared language, or arbitration frameworks.
Shared Fate Contracts: Tie rewards to joint outcomes—if one fails, all are penalized; if all succeed, bonuses scale.
Civilizational Bootstrapping
Garbage In, Garbage Out: If AI learns from our current systems, and those systems are misaligned, we bake our dysfunctions into the AI.
Institutional Clarity is a Precursor to Safe Optimization: You can’t align AI to humanity if humanity doesn’t know what it values.
AI Alignment Is Downstream of Social Alignment: Coherent reward signals can only come from coherent systems.
You cannot train an AGI to align with human goals if those goals are unmeasured, contradictory, or distorted by existing incentives. The first stage of alignment is making our own institutions legible, value-consistent, and reward-sane. Otherwise, AGI just becomes a high-speed optimizer for a broken world.
A foundation model learns from human data: text, feedback, behavior. If that data encodes proxy-chasing, adversarial tactics, tribal heuristics, and engagement farming, the model will internalize those patterns.
Likewise, if reward functions are shaped by misaligned institutions—corporations maximizing short-term profit or political systems optimizing popularity—the resulting AI will mimic and amplify those distortions.
To avoid this, the inputs must be cleaned and the training environments upgraded. AI alignment depends on civilizational legibility: institutions that produce clean signals of what humans truly value, not signals distorted by broken feedback loops.
Institutional Alignment Audits: Evaluate major institutions (education, government, science) for proxy-metric distortion and alignment fidelity.
High-Trust AI Pilots: Test early-stage models in environments where alignment feedback is unusually clean (e.g., scientific inquiry, cooperative games, structured deliberation).
Civic Epistemology Training Sets: Feed AI models with data from well-reasoned, multi-perspective, non-tribal human debates and processes.
Incentive Hygiene for AI Labs: Align corporate and research lab objectives with public interest via transparency, third-party audits, and open governance.
Simulated Alignment Societies: Build testbeds of aligned human-AI interaction in sandboxed environments that explore what full-stack value-aligned systems could look like.