
August 4, 2025
François Chollet is a widely recognized figure in the field of artificial intelligence, not only for creating Keras, one of the most widely used deep learning libraries in the world, but also for challenging the mainstream trajectory of AI research. While much of the field celebrates scaling laws and massive models as the path forward, Chollet has consistently pushed back, arguing that this approach represents a fundamental misunderstanding of what intelligence really is. His work goes beyond engineering; it is a call to rethink the foundations of the discipline and to anchor progress in a scientifically rigorous and philosophically coherent definition of intelligence. For Chollet, the problem is not that current models are powerful—they are—but that they are powerful in narrow, brittle ways that fail to address the real challenge of generalization and abstraction.
From his perspective, the current obsession with large language models and multimodal architectures reflects a dangerous illusion: that bigger datasets and more parameters will inevitably lead us to human-level general intelligence. These systems achieve state-of-the-art performance because they have absorbed nearly all available human-generated text and optimized over billions of gradient steps, not because they can autonomously reason about novel situations. Their competence is statistical, not conceptual; they interpolate within patterns rather than extrapolate beyond them. For Chollet, this is the crux of the issue. General intelligence cannot emerge from memorization, no matter how vast the dataset. True intelligence requires the ability to operate in open-ended domains, to form abstractions that compress experience into rules, and to apply these rules flexibly in entirely new contexts. Scaling brute force is a dead end because it sidesteps these requirements.
To provide a solid theoretical grounding for his critique, Chollet proposed one of the most precise formal definitions of intelligence to date: “Intelligence is a measure of skill acquisition efficiency over a scope of tasks, relative to priors, experience, and generalization difficulty.” This definition reframes the conversation by emphasizing efficiency, adaptability, and scope rather than raw performance. Intelligence, under this lens, is not the sum of skills a system possesses but the process that generates those skills efficiently. Humans are not born knowing language or algebra; they are born with the ability to learn these skills rapidly under constraints. By contrast, today’s AI systems require millions of labeled examples and enormous compute budgets to approximate abilities that humans learn from a handful of demonstrations. Chollet’s definition exposes this gap and shows why current evaluation metrics are inadequate for measuring real progress toward AGI.
To operationalize this vision, Chollet introduced the Abstraction and Reasoning Corpus (ARC), a benchmark explicitly designed to test the ability to generalize to unseen tasks using minimal examples. Unlike conventional benchmarks that can be conquered through memorization or pretraining on massive datasets, ARC presents problems drawn from a combinatorial design space so vast that no dataset can cover it. Solving ARC requires discovering abstract structural rules—such as symmetry, color grouping, or object persistence—from just three to five demonstrations and applying them to new cases. These are the very cognitive moves humans make instinctively. Yet, despite years of progress in deep learning, ARC remains a steep challenge for AI systems: humans routinely score above 95%, while cutting-edge models languish below 40%. This persistent gap is not an accident; it reveals what Chollet considers the true bottleneck for AGI—our failure to build systems capable of abstraction, compositionality, and causal reasoning.
Based on these insights, Chollet has laid out a roadmap of milestones that must be achieved for human-level general intelligence to become attainable. First, research priorities must shift away from static benchmarks and brute-force scaling toward dynamic evaluations of generalization. Systems should be measured by their ability to learn quickly, reason flexibly, and adapt autonomously, not by their ability to memorize ever-larger datasets. Second, AI architectures must incorporate mechanisms for autonomous abstraction formation—the ability to synthesize new concepts from raw observations without explicit programming. Third, compositional reasoning must be a core design principle: intelligent systems should build complex solutions by recombining simpler elements, mimicking the combinatorial creativity of human thought. Without these capabilities, models will remain trapped in statistical mimicry, unable to transcend the confines of their training distributions.
Equally important, Chollet insists that efficiency—not scale—defines intelligence. Current models consume terabytes of text and petaflops of compute to achieve competence in language tasks, whereas humans achieve comparable mastery of natural language through a few years of sparse experience. The path to HGI requires architectures that are data-frugal, compute-efficient, and energy-conscious, reflecting the astonishing economy of the human brain. This efficiency imperative extends beyond resources to knowledge representation: systems must encode experience in modular, reusable abstractions rather than sprawling, entangled weight matrices. Without compact and transferable internal structures, adaptation will remain prohibitively expensive, making lifelong learning impossible.
Another pillar of Chollet’s vision is structural cognition: the integration of neural and symbolic paradigms. Pure pattern-matching systems, no matter how large, lack the algorithmic scaffolding required for systematic reasoning and causal inference. By combining the perceptual strengths of neural networks with the structured logic of symbolic systems, we can create architectures capable of both recognizing patterns and manipulating rules. This hybrid approach, coupled with meta-learning, would enable systems to reflect on their own strategies, improving their learning processes over time. For Chollet, meta-learning is the real lever of intelligence, because it transforms experience into acceleration: each solved problem makes the next one easier, closing the loop toward autonomous self-improvement.
Finally, Chollet argues that intelligence is purposive. It is not enough for systems to respond passively to prompts; they must demonstrate agency—the ability to generate goals, prioritize them under constraints, and navigate trade-offs under uncertainty. Creativity emerges from this agency: the capacity to produce novel, useful, and contextually appropriate solutions beyond memorized patterns. Risk-aware decision-making, dynamic goal management, and autonomous planning are not peripheral features but core requirements for any system aspiring to human-level generality. In this sense, the milestones Chollet envisions are not incremental extensions of today’s deep learning—they demand a paradigm shift. From static pattern recognition to dynamic, adaptive, and self-directed intelligence; from monolithic architectures to modular hybrids; from brute-force scaling to elegant, resource-efficient design.
Chollet’s roadmap is both sobering and inspiring. It rejects the seductive simplicity of “just make it bigger” and calls for deeper questions: How do we formalize abstraction? How do we measure generalization fairly? How do we architect systems that learn as efficiently as humans? These questions define not only a technical challenge but a philosophical stance: that intelligence is a process, not a dataset; a system’s capacity to continually invent solutions, not its ability to replay them. For Chollet, the journey to AGI is not about adding layers but about building systems that learn how to learn, reason about their reasoning, and evolve their own capabilities with minimal supervision. Until we meet these milestones, what we call “intelligence” will remain an illusion painted by scale.
Core Idea: Stop chasing benchmark scores and scaling models; start prioritizing generalization and skill acquisition efficiency.
Focus on learning from minimal data, not brute-force memorization.
Evaluate intelligence relative to priors (like humans’ core knowledge).
Use open-ended benchmarks (e.g., ARC) resistant to shortcutting.
Progress = systems that adapt to novel tasks quickly.
Core Idea: Intelligence = ability to create and manipulate abstractions autonomously.
Build systems that derive abstract rules without hand-coding.
Achieve compositional reasoning: combine simple concepts into complex ideas.
Develop efficient internal representations for transferability.
Enable hierarchical reasoning for multi-step problem-solving.
Core Idea: Intelligence = doing more with less.
Optimize for data frugality (few-shot learning as default).
Prioritize compute and energy efficiency (stop brute-force scaling).
Design compact memory structures for reusable knowledge.
Measure efficiency per performance, not just raw accuracy.
Core Idea: Pure deep learning won’t reach HGI; structured cognition is essential.
Integrate symbolic reasoning with neural perception for abstraction and logic.
Implement meta-learning at the symbolic level for adaptive strategies.
Build modular architectures enabling reuse across domains.
Include causal reasoning for true understanding, not correlation mimicry.
Core Idea: HGI systems must learn how to learn—and do it autonomously.
Implement meta-learning: improve skill acquisition over time.
Enable self-reflection: systems reason about their own performance.
Achieve continuous adaptation without catastrophic forgetting.
Develop self-repair mechanisms for autonomous error correction.
Core Idea: Intelligence is purposive, not reactive.
Allow systems to generate and reprioritize goals autonomously.
Enable dynamic planning under changing conditions.
Foster creativity: producing novel, useful solutions beyond memorized patterns.
Integrate risk-aware decision-making to handle uncertainty safely.
Chollet repeatedly stresses that the AI community’s current trajectory—dominated by scaling up deep learning and pursuing benchmark scores—is insufficient and misaligned with achieving AGI. Current systems excel at narrow tasks and pattern recognition but fail catastrophically at generalization and abstraction. This failure is central to his ARC work and his critique of benchmark-driven progress.
He argues that true intelligence is not about task-specific mastery but about the efficiency of acquiring new skills across a wide variety of novel tasks under resource constraints. To achieve this, he calls for a paradigm shift in research priorities away from “brute-force scaling” toward generalization-centric, resource-efficient, and autonomy-driven AI research.
Redirect AI research from achieving high performance on narrow, well-defined benchmarks to building systems that generalize to new, unseen tasks, requiring minimal retraining and leveraging abstract reasoning.
Current benchmarks like ImageNet encourage overfitting to narrow domains and memorization strategies.
Intelligence ≠ performance on fixed tasks; true intelligence = ability to adapt to novel tasks efficiently.
Chollet’s formal definition:
“Intelligence is a measure of skill acquisition efficiency over a scope of tasks, relative to priors, experience, and generalization difficulty.”
Human intelligence shines because of generalization across countless unforeseen situations, not memorized solutions.
Replace task-specific benchmarks with generalization benchmarks:
ARC (Abstraction and Reasoning Corpus): Measures ability to infer abstract rules from minimal examples.
Out-of-distribution generalization tasks.
Measure learning curves under limited-data regimes (few-shot, zero-shot learning).
Emphasize meta-learning benchmarks that test adaptability across tasks.
Weak: GPT-like models achieve impressive in-distribution generalization but fail dramatically on ARC, which tests abstraction and open-ended reasoning.
Chollet’s Critique: Scaling laws improve interpolation, not extrapolation; current systems remain statistical parrots, not reasoning entities.
Stop relying on brute-force data and compute scaling as the primary strategy for progress toward AGI.
Scaling large language models (LLMs) produces diminishing returns for generalization beyond training distributions.
Memorization ≠ intelligence:
Memorization allows solving known problems, but cannot handle truly novel tasks.
Biological systems (human brain) achieve AGI with low energy, modest compute, and tiny training data compared to LLMs.
Evaluate algorithms for efficiency and abstraction, not raw benchmark scores.
Explicitly track resource-to-performance ratios: amount of data, compute, and energy per generalization improvement.
Promote research in architectures optimized for symbolic reasoning, representation learning, and meta-learning.
Current frontier models depend heavily on scaling:
GPT-4 → trillions of parameters + terabytes of text data.
Costs millions of dollars in compute and energy.
Chollet’s View: Scaling-based progress cannot bridge the gap to AGI because it sidesteps abstraction and reasoning; it only inflates memorization capacity.
Design benchmarks that explicitly account for priors (innate knowledge) used by the system, ensuring fair comparisons across architectures and aligning with human cognition.
Chollet references Spelke’s Core Knowledge Theory: Humans are born with minimal priors like:
Objectness & basic physics
Agentness & goal-directedness
Geometry/topology awareness
Numerosity
Current AI systems embed massive implicit priors in their weights (from huge datasets), making them look more intelligent than they are.
True intelligence = efficient reasoning on minimal priors.
ARC: All tasks rely only on core priors, avoiding language, cultural knowledge, or dataset biases.
Track explicit priors given to AI systems in benchmarks.
Compare performance normalized for prior knowledge load.
LLMs leverage massive implicit priors learned from billions of documents → unfair advantage in narrow tasks, yet still fail at minimal-prior tests like ARC.
Chollet calls for benchmarks that penalize hidden prior overloading and reward reasoning from scratch.
Develop benchmarks and frameworks that simulate open-ended problem spaces, forcing AI systems to tackle genuinely novel and diverse tasks that cannot be memorized or brute-forced.
Real-world intelligence thrives on unpredictability.
IQ tests and ARC tasks succeed because:
They prevent pre-computation of all possible answers.
They measure adaptability and abstraction.
Current benchmarks fail because they are static and can be solved by memorization.
Expand ARC-like competitions:
Dynamically generated tasks → impossible to anticipate or pre-train on.
Require reasoning, pattern discovery, and abstraction.
Introduce never-before-seen task generators in benchmarks for ongoing evaluation of adaptability.
Very poor performance:
ARC Leaderboards: Humans achieve >95%; best AI ~35–40%.
No existing system demonstrates robust open-ended adaptability.
Chollet’s Position: Without open-ended evaluation, AGI progress claims are misleading and over-optimistic.
Chollet insists that achieving AGI requires a foundational paradigm shift:
From benchmark chasing → generalization testing.
From brute-force scaling → efficiency-driven innovation.
From hidden prior exploitation → transparent prior normalization.
From static datasets → open-ended challenges.
Bottom line: Stop measuring “who memorizes better”; start measuring who learns faster, reasons deeper, and adapts more flexibly with minimal priors and resources.
Chollet repeatedly emphasizes that abstraction is the engine of intelligence. While current AI models excel at pattern recognition, they fundamentally fail at creating new abstractions. This gap explains why models like GPT-4 can mimic reasoning patterns in text but collapse when facing tasks requiring genuine conceptual synthesis (e.g., ARC puzzles).
To achieve AGI, systems must autonomously discover representations, form abstractions, and recombine them compositionally across domains.
Build systems capable of autonomously generating abstract rules or concepts from raw observations, without explicit human-coded templates or brute-force memorization.
Humans solve novel tasks by forming abstract rules from a few examples.
Abstraction allows transfer from specific experiences to a vast space of unknown scenarios.
Current LLMs rely on pattern interpolation; abstraction = extrapolation beyond training data.
ARC Benchmark: Requires discovering latent rules (e.g., symmetry, color grouping, shape completion) never seen before.
Measure speed & efficiency of abstraction from minimal examples.
Evaluate ability to verbalize or encode discovered rules.
Extremely weak: LLMs and vision models fail at ARC because they can’t autonomously hypothesize rules beyond their statistical priors.
Chollet: “Current systems cannot autonomously generate new abstractions—they can only remix what they have memorized.”
Enable systems to compose new ideas or solutions by combining simpler concepts already known, producing novel but structured outputs.
Chollet frames compositionality as key to scalability of intelligence:
“The ability to combine a small set of concepts into an unbounded number of new ideas is what gives human cognition its power.”
Current AI lacks flexible compositional generalization, leading to brittle performance outside training distributions.
Test via ARC tasks requiring multi-step transformations (e.g., “reflect + recolor” → requires combining two distinct rules).
Evaluate models in cross-domain reasoning (combine geometry + numerosity).
Weak: LLMs approximate compositionality in language but fail in symbolic reasoning or visual tasks that require explicit combination of operations.
Equip AI systems to autonomously build efficient, interpretable internal representations that capture structure and enable reasoning.
Chollet emphasizes that current AI stores dense statistical correlations instead of abstract representations.
Representation learning is critical for knowledge reuse and efficient generalization across domains.
Evaluate internal state structures (are they modular, transferable?).
Use compression-based metrics: shorter description length = better abstraction.
Test transfer learning performance to entirely novel task distributions.
Poor: LLMs’ representations are highly entangled and opaque.
Chollet: “They have no explicit abstractions. They memorize statistical regularities but don’t distill them into conceptual structures.”
Design systems to reason across multiple abstraction layers, decomposing complex tasks into smaller steps and integrating sub-solutions.
Humans solve complex problems by creating hierarchies (e.g., planning: long-term goals → mid-level strategies → atomic actions).
Chollet notes that hierarchy gives combinatorial efficiency and adaptability.
Multi-step ARC puzzles: e.g., “Find largest shape → color swap → mirror transform.”
Evaluate explicit decomposition abilities:
Can the system articulate intermediate steps?
Does it optimize search over compositional space?
Limited: LLMs can mimic step-by-step reasoning when prompted (“chain of thought”), but they do not autonomously build hierarchical strategies.
ARC reveals: current models cannot break down problems without heavy hand-holding.
AGI demands abstraction as its foundation. Chollet insists that without autonomous abstraction formation, explicit compositionality, strong internal representation learning, and hierarchical reasoning, scaling will hit a wall. Current AI systems:
Fail to generate new rules (abstraction gap).
Struggle to recombine concepts adaptively (compositional gap).
Lack modular, reusable representations (representation gap).
Cannot autonomously plan in layered reasoning spaces (hierarchical gap).
Bottom line: To reach AGI, we must replace “brute-force pattern fitting” with structured, self-directed concept formation and combinatorial reasoning architectures.
Chollet explicitly defines intelligence as skill-acquisition efficiency, which inherently involves minimizing the cost of learning and problem-solving in terms of data, compute, energy, and memory.
Current large-scale AI systems achieve impressive results, but their approach—brute-force scaling—contradicts efficiency principles. GPT-4’s massive training regime (trillions of tokens, megawatts of energy) is an example of what Chollet argues is a dead end for achieving AGI.
To reach AGI, research must pivot from bigger models → smarter algorithms, emphasizing architectures that learn fast, reason with little data, and use resources optimally.
Build systems that can learn robust abstractions and generalize to unseen tasks using minimal examples—as humans do.
Humans can learn a new skill from a handful of demonstrations, sometimes from a single exposure.
Current AIs need billions of examples for narrow tasks, which is antithetical to intelligence.
Chollet calls this problem “buying intelligence with data,” which creates brittle and non-generalizable systems.
ARC: Each task provides 3–5 demonstrations only.
Few-shot and zero-shot learning benchmarks:
Measure performance per example rather than absolute accuracy.
Learning curves:
How fast does accuracy improve as examples increase?
Poor: LLMs appear to do few-shot learning but mostly rely on pattern recall from enormous datasets.
Chollet: “When you’ve seen everything, zero-shot performance is an illusion.”
Enable models to reason and learn using minimal compute, rather than relying on massive parameter counts and training steps.
Efficiency is critical because brute-force compute scaling has diminishing returns and environmental costs.
Humans achieve AGI with ~20 W of brain power, not megawatt-scale clusters.
Computational efficiency ties back to algorithmic elegance—smarter architectures over bigger GPUs.
Normalize performance by FLOPs or runtime cost.
Test on ARC-like tasks under strict compute budgets.
Reward efficiency-oriented solutions in AGI benchmarks.
Extremely poor: Frontier models (GPT-4, Gemini) cost millions in compute; inference cost is high too.
Chollet: “We cannot scale our way to AGI; efficiency, not size, is the bottleneck.”
Design AI systems that minimize energy per inference and per training epoch, approximating the energy efficiency of biological systems.
Energy scaling ≠ intelligence scaling.
Energy waste is a direct symptom of brute-force design, not smart algorithms.
Sustainability aside, physical constraints make energy-hungry AGI architectures non-viable at global scale.
Benchmark energy per task (joules per ARC puzzle solved).
Compare energy-to-performance ratios with human brain estimates.
Critical weakness: GPT-class models consume enormous energy for training and inference.
Chollet warns this is a dead end: systems must become thousands of times more energy-efficient to approach human-level intelligence.
Ensure systems store and retrieve knowledge compactly, modularly, and with minimal redundancy.
Memory optimization = efficient abstraction.
Humans store concepts and rules as compressed representations; current AI stores billions of weights encoding patterns redundantly.
Poor memory architecture → catastrophic forgetting or inefficiency.
Evaluate internal state complexity vs. performance (compression ratio).
Test for reusability of learned modules across tasks (transfer benchmarks).
Weak: Neural networks lack modular memory; knowledge is distributed across weights, making reuse and updates costly.
Chollet: “Opaque entangled representations break generalization.”
AGI cannot be brute-forced by throwing more compute, data, and energy at the problem.
Chollet insists progress depends on:
Data frugality → Learning from few examples.
Compute and energy efficiency → Algorithmic leaps, not bigger clusters.
Memory compactness → Modular, reusable internal representations.
Bottom line: AGI must be elegant—a system that does more with less, like the human brain.
Chollet argues that current deep learning models lack structured reasoning and operate almost entirely through pattern interpolation, which is insufficient for true generalization and abstraction.
He emphasizes that hybrid systems—combining symbolic reasoning with the representational power of neural networks—are essential for AGI. Why? Because human intelligence relies on:
Symbolic manipulation (rules, logic, hierarchical planning).
Perceptual learning (neural pattern recognition).
These complementary paradigms must be integrated for AGI to achieve robust abstraction, compositionality, and reasoning under uncertainty.
Develop architectures that integrate the statistical strength of neural networks with the structured, rule-based reasoning of symbolic systems.
Neural nets excel at perception but fail at systematic reasoning.
Symbolic systems excel at reasoning but fail at perception.
AGI requires both:
Neural layers for raw input → symbolic layers for compositional logic.
“You cannot brute-force search in an infinite combinatorial space; you need structured representations and symbolic abstractions.”
Hybrid models tested on ARC:
Perception handled by neural nets.
Rule inference handled by symbolic engines.
Benchmarks for abstraction depth and reasoning explainability.
Primitive: Some neuro-symbolic prototypes exist (DeepMind’s Gato + symbolic planners), but no large-scale hybrid achieving strong ARC performance.
Chollet: “Deep learning alone will not solve AGI.”
Enable systems to adapt their learning strategies by dynamically building symbolic abstractions about tasks and reasoning paths.
Chollet frames meta-learning as essential because:
Intelligence = improving at learning over time.
Symbolic meta-learning allows reflective reasoning about strategies, enabling higher-order adaptation.
Neural-only systems lack explicit meta-reasoning.
Evaluate systems’ ability to:
Generate new reasoning strategies without retraining.
Transfer symbolic learning across tasks.
Benchmarks:
“Meta-ARC” → meta-level reasoning about rule inference efficiency.
Weak: LLMs fake meta-learning via pattern recall, not true strategy invention.
Chollet: “No system today can autonomously create new reasoning strategies in unseen environments.”
Build architectures where reasoning and perception modules are separable and reusable across domains, enabling flexible recombination of learned skills.
Human cognition is modular:
Vision, language, planning, causal inference = loosely coupled.
Current AI = monolithic networks → brittle and costly to update.
Chollet: “Without modularity, every adaptation requires retraining the entire system.”
Evaluate transfer learning in ARC:
Can a module learned for color transformations be reused in shape tasks?
Assess composition speed and parameter isolation.
Poor: Today’s LLMs and vision models are monolithic; module reusability = near zero.
“The lack of modularity makes current AI incredibly inefficient and inflexible.”
Equip systems with explicit causal reasoning engines, moving beyond statistical correlation to genuine cause-effect understanding.
Chollet emphasizes causality as the key difference between statistical models and intelligent agents.
Without causal reasoning, AI cannot:
Predict effects of actions in novel environments.
Generalize knowledge structurally.
ARC puzzles often require causal inference (e.g., “if color = blue → move object”).
Benchmarks for causal abstraction:
Simulated environments with manipulable variables.
ARC variants where solutions require hypothetical reasoning (“What if I apply this rule?”).
Minimal: Current LLMs cannot model explicit causality.
Chollet: “Without causal inference, generalization is an illusion.”
AGI cannot emerge from monolithic pattern-matching models. Chollet prescribes structural intelligence built on:
Hybrid architectures (symbolic + neural).
Meta-learning capabilities for strategy-level adaptation.
Modularity for scalability and transferability.
Causal reasoning as a first-class citizen, not an afterthought.
Bottom line: Brains are structured, modular, causal; AGI systems must be too.
Chollet argues that intelligence is not just the ability to learn but the ability to improve learning itself—what he calls skill-acquisition efficiency. Humans excel because they learn how to learn: every new experience refines our meta-strategies, enabling faster, more general adaptation in the future.
Current AI systems lack this capability. They “learn” through static optimization on massive datasets, then freeze their parameters. Any update = costly retraining, not autonomous refinement. For AGI, AI must:
Monitor its own performance.
Detect weaknesses.
Improve strategies without human intervention.
Adapt continuously without catastrophic forgetting.
Enable systems to improve their own learning processes autonomously over time, using experience from diverse tasks to generalize faster in new ones.
Meta-learning allows:
Accumulating “learning priors” for accelerating skill acquisition.
Developing internal rules about rules (second-order reasoning).
Without meta-learning, adaptation speed remains flat → no cumulative intelligence.
Multi-task sequences (ARC variants): Measure if the system learns faster on task N than on task N-1.
Evaluate strategy generalization across domains:
E.g., after learning symmetry on one puzzle, apply it in unrelated contexts.
Superficial: LLMs mimic meta-learning via dataset coverage, not true self-improvement.
Chollet: “Our systems don’t improve themselves—they are improved by retraining.”
Develop AI that can evaluate its own knowledge gaps, reasoning errors, and confidence levels, enabling corrective action autonomously.
Humans engage in meta-cognition:
We ask: “Am I sure? Do I need more evidence? Did I fail?”
Current AI outputs answers without awareness of uncertainty or knowledge limits.
Chollet stresses introspection as critical for self-directed improvement and safety.
Benchmarks requiring uncertainty reporting and error diagnosis:
Can the system flag its low-confidence answers?
Meta-tasks: Detect failure and self-correct without external labeling.
Poor: LLM confidence scores correlate weakly with accuracy.
No autonomous pipeline for error-driven self-improvement.
Chollet: “LLMs cannot reason about their reasoning.”
Systems must adapt incrementally to new tasks without catastrophic forgetting or full retraining, maintaining performance across old and new domains.
Continuous adaptation = survival in dynamic worlds.
Current AI suffers from:
Catastrophic forgetting: new learning erases old knowledge.
Static parameterization → no lifelong learning.
Sequential ARC tasks: Evaluate retention of old skills after solving new puzzles.
Lifelong learning benchmarks:
Performance trajectory across evolving distributions.
Weak: Continual learning is an active research field, but mainstream AGI (LLMs) rely on snapshot models—zero true lifelong adaptability.
Chollet: “Static models = dead intelligence.”
Equip systems with mechanisms to identify mistakes, infer causes, and generate self-corrections—without external re-labeling or retraining.
Humans learn from failure autonomously.
Current AI:
Often unaware of being wrong.
Requires curated feedback loops.
Chollet emphasizes self-repair as critical for scalable autonomy.
Design tasks where initial solution is wrong, but system can revise its own reasoning iteratively.
Measure:
How often does the model recognize its error?
How fast does it recover?
Minimal: LLM “self-correction” (via prompting) = illusion; still driven by human instruction.
Chollet: “Current systems don’t learn from their own mistakes—they just output another guess.”
HGI demands systems that:
Learn how to learn (meta-learning).
Think about their thinking (introspection).
Adapt continuously without forgetting.
Self-correct without supervision.
Bottom line: True general intelligence is self-improving, not frozen at training time.
Chollet: “As long as learning is static, there is no intelligence—only a database of patterns.”
Chollet argues that intelligence is inherently active and purposive. Humans don’t just react—they set goals, plan, adapt strategies, and create novel solutions.
Current AI systems lack genuine agency: they execute externally specified tasks without self-generated objectives or adaptive decision-making. Chollet emphasizes that to reach AGI, systems must:
Define and reprioritize goals autonomously.
Handle trade-offs and uncertainty.
Demonstrate creativity beyond pattern recall.
Develop systems capable of generating, prioritizing, and modifying goals without explicit external commands, aligned with high-level objectives.
Intelligence = purposeful behavior.
Without internal goal formation:
AI remains a passive pattern generator.
True agency requires:
Anticipating future needs.
Dynamically creating subgoals.
Goal-discovery tasks:
Present open-ended environments (e.g., ARC variants, sandbox simulations).
Evaluate if the system identifies novel intermediate objectives autonomously.
Metrics:
Diversity and relevance of generated goals.
Adaptation speed when context changes.
Non-existent: GPT-class models do not set goals; they react to prompts.
Chollet: “Static models cannot have agency—they cannot want.”
Enable AI to reprioritize objectives dynamically, managing multiple goals under evolving constraints.
Humans constantly adjust plans:
New information = goal reprioritization.
Without flexible goal management:
AI fails in dynamic environments.
Sequential ARC tasks with conflicting objectives.
Multi-objective benchmarks requiring:
Trade-off reasoning.
Dynamic strategy switching.
Weak: Reinforcement learning agents manage limited goals, but flexibility is rigid and brittle.
LLMs: Zero autonomous prioritization capability.
Equip AI to produce genuinely novel, useful, and context-appropriate ideas, not just recombinations of memorized patterns.
Creativity = engine of open-ended generalization.
Chollet stresses:
Pattern interpolation ≠ creativity.
True creativity = abstraction-driven recombination + innovation.
ARC-based novelty tests:
Require solutions with patterns not seen in training.
Evaluate:
Originality (does it differ from memorized patterns?).
Functionality (is it effective and generalizable?).
Superficial: LLM “creativity” = probabilistic remixing of dataset patterns.
Chollet: “Creativity cannot emerge from memorization alone.”
Develop systems capable of balancing exploration and exploitation, reasoning under uncertainty, and evaluating trade-offs between risks and rewards.
Intelligence thrives in uncertain environments.
Risk management:
Requires forecasting consequences.
Demands causal and probabilistic reasoning—both weak in current AI.
Without it:
Systems fail in real-world complexity.
Tasks introducing stochastic outcomes:
Require safe yet exploratory strategies.
Metrics:
Performance stability under uncertainty.
Ability to self-calibrate risk levels.
Poor: LLMs lack mechanisms for explicit risk modeling.
Reinforcement agents approximate it, but fail in open-ended domains.
To achieve HGI, systems must go beyond passive pattern completion. They must:
Generate their own goals.
Reprioritize dynamically under uncertainty.
Innovate beyond memorization.
Evaluate risks intelligently.
Bottom line: True general intelligence = active, purposive, and creative.
Chollet: “Without agency, adaptation and creativity, intelligence is an illusion.”