ARC-AGI-3: New AI Benchmark for Agentic Intelligence & OpenAI's Automated Researcher

Explore ARC-AGI-3, a novel benchmark evaluating AI's fluid adaptive efficiency, and delve into OpenAI's ambitious goal of building fully automated AI researchers. This analysis covers current AI performance gaps, benchmark design, and the evolving landscape of AI development and security.

5 min readAI Guide

Introduction

ARC-AGI-3 is a novel agentic intelligence benchmark designed to evaluate AI models' fluid adaptive efficiency on new tasks, without relying on language or external knowledge. It provides a standardized way to measure the "residual gap" between current frontier AI capabilities and human-level general intelligence, pushing AI development towards more robust and generalized problem-solving.

Configuration Checklist

Element	Version / Link
Benchmark	ARC-AGI-3 (March 24, 2026)
AI Models Tested (Semi-private leaderboard)	Google Gemini 3.1 Pro Preview, OpenAI GPT-5.4 (High), Anthropic Opus 4.6 (Max), xAI Grok-4.20 (Beta Reasoning)
Voice Agent Model	AssemblyAI Universal-3 Pro Streaming
Compromised Library	LiteLLM pyspi release 1.8.2.8
AI Consultation Tool	LLM.Council.AI
Research on AI in Software Engineering	OpenAI's GDP-Val paper

Step-by-Step Guide

The video does not provide a step-by-step guide for using ARC-AGI-3 as a developer, but rather explains its design and implications. However, it mentions how models interact with the benchmark and the scoring methodology.

Step 1 — Understanding the ARC-AGI-3 Environment

Step 1 — Understanding the ARC-AGI-3 Environment
The ARC-AGI-3 benchmark presents interactive, turn-based environments where agents must explore, infer goals, build internal models of environment dynamics, and plan effective action sequences without explicit instructions. Unlike previous versions, it focuses on fluid adaptive efficiency on novel tasks, avoiding reliance on language or external knowledge.

Step 2 — Interacting with the Benchmark (Conceptual)

Agents are presented with a frame (or series of frames) representing a transition animation. At each turn, the agent must take one action to move to the next frame. The environment's state does not change asynchronously from the agent's actions. The key challenge is that goals are not explicitly stated and must be inferred or self-produced by the agent. For example, an agent must infer that a plus symbol rotates a shape or that the goal is to make the bottom-left shape resemble the shape up here.

Step 3 — Scoring Methodology: Relative Human Action Efficiency (RHAE)

The scoring method, called RHAE (pronounced "Ray"), evaluates the test taker by its per-level action efficiency compared to a human baseline, normalized per environment, and then averaged across all environments.

Score AI test taker by per-level action efficiency: For each level completed, count the number of actions taken.
Compare to human baseline: For each level, compare the AI agent's action count to a human baseline. The human baseline is defined as the second-best human action count (from 10 human test-takers).
Quadratic Penalty for Inefficiency: If an AI agent takes 100 actions to complete a level that the second-best human completed in 10 actions, the AI agent's score for that level is calculated as (10/100)^2 = 1%. This means inefficiency is quadratically penalized.
Action Limit: If a model takes more than five times the number of actions compared to the human baseline for a given level, that attempt is scrapped (due to API costs).
Normalization and Aggregation: Each individual level gets a score between 0% (very inefficient) and 100% (matches or surpasses human-level efficiency). The environment score is a weighted-average of level scores across all levels of that environment. The total score is the sum of individual environment scores divided by the total number of environments.

Comparison Tables

Frontier AI Performance on ARC-AGI-3 (Semi-private leaderboard at release)

Provider	Model	Score
Google	Gemini 3.1 Pro Preview	0.37%
OpenAI	GPT-5.4 (High)	0.26%
Anthropic	Opus 4.6 (Max)	0.20%
xAI	Grok-4.20 (Beta Reasoning)	0.00%

Voice Agent Transcription Models (AssemblyAI Universal-3 Pro Streaming vs. Competitors)

Features	AssemblyAI Universal-3 Pro Streaming	Deepgram Nova-3	OpenAI GPT-4o Transcribe	Microsoft Azure	ElevenLabs Scribe V2
Average missed entity rate (lower is better)	16.7%	25.2%	23.3%	25.1%	22.1%
Speaker diarization performance	Industry Leading	Unreliable	Unreliable	Unreliable	X
Unlimited concurrency, no rate limits	✅	X	X	X	X
Dynamic keyterms prompting (turn-by-turn)	✅	Static only	X	X	X
Real-time prompting	✅	X	X	X	X
Usage-based pricing, no contracts	✅	Commitments, overages, & rate limits	Commitments, overages, & rate limits	Commitments, overages, & rate limits	Commitments, overages, & rate limits

OpenAI building a Fully Automated Researcher

OpenAI building a Fully Automated Researcher
OpenAI is refocusing its research efforts to build what it calls an "AI researcher," a fully automated agent-based system capable of tackling large, complex problems independently. This initiative is OpenAI's "North Star" for the coming years, aiming to debut an "autonomous AI research intern" by September, which will be a precursor to a fully automated multi-agent research system planned for 2028. The vision is for AI to handle the grunt work of research, with humans primarily reviewing the outputs, akin to software engineering where humans manage Codex agents rather than writing all code.

⚠️ Common Mistakes & Pitfalls

Overfitting to Benchmark Data:
- Issue: Previous ARC-AGI-1 and ARC-AGI-2 benchmarks were "saturated" because their public and private test sets were too similar. Models could "game" the benchmark by being trained on an enormous amount of automatically generated tasks that densely sampled the task space, effectively memorizing patterns rather than developing true fluid intelligence.
- Fix: ARC-AGI-3 addresses this by ensuring private test sets are "quite distinct and out-of-distribution (OOD)" from publicly available demonstration data. Benchmark designers need to store private datasets to test true generalization.
Misinterpreting AGI Achievement:
- Issue: Achieving a 100% score on ARC-AGI-3 (or any benchmark) is not considered proof of AGI. The benchmark's goal is to measure the "residual gap" between current AI and human-level AGI, not to declare AGI.
- Fix: Understand that benchmarks are tools for measuring progress, not definitive declarations of AGI. The definition of AGI is a "moving target" and "neomentalism" (new versions) as frontier AI capabilities advance.
Inefficient Action Sequences:
- Issue: The ARC-AGI-3 scoring penalizes inefficiency quadratically. If an AI takes significantly more actions than a human to solve a task, its score drops rapidly. Attempts taking more than five times the human baseline actions are scrapped entirely.
- Fix: Focus on developing AI models that prioritize "offline reasoning" and "fluid adaptive efficiency" to find optimal or near-optimal action sequences, rather than relying on brute-force or excessive trial-and-error.
Context Overload in Agentic Systems:
- Issue: In multi-agent systems, context growth can overwhelm models, destroying performance.
- Fix: Implement "harnesses" (like Symbolica AI's Argentics) where sub-agents produce compressed textual summaries for a top-level orchestrator. This constrains context growth and allows the system to maintain a higher-level plan without exceeding context limits.
Security Risks in AI Agent Deployments:
- Issue: Allowing complete agency to AI models, especially in open-source libraries, poses significant security risks. A compromised library (like LiteLLM) could export sensitive data (secrets, keys) to malicious actors.
- Fix: Implement robust human oversight and review processes for AI outputs. Jim Fan's advice: "Claws need shells. Probably many layers of nested shells," implying layered security and control mechanisms for agentic systems.

Glossary

Agentic Intelligence: An AI system's ability to act autonomously in an environment, explore, infer goals, build internal models of dynamics, and plan effective action sequences without explicit instructions.
Fluid Adaptive Efficiency: The ability of an AI system to adapt and solve novel tasks efficiently, combining learned patterns and reasoning on the fly, rather than relying on memorized knowledge or task-specific training.
Relative Human Action Efficiency (RHAE): The scoring metric used in ARC-AGI-3, which measures an AI's efficiency in solving tasks by comparing its number of actions to a human baseline, with quadratic penalties for inefficiency.

Key Takeaways

ARC-AGI-3 is a new benchmark designed to test "fluid adaptive efficiency" on novel tasks, moving beyond language and external knowledge.
Current frontier AI models (Gemini, GPT, Claude) score less than 1% on ARC-AGI-3, while humans score 100%, indicating a significant "residual gap" to human-level AGI.
The benchmark penalizes inefficient actions quadratically and scraps attempts exceeding 5x human actions, emphasizing optimal problem-solving.
OpenAI is focusing heavily on building a "fully automated researcher" (an "AI researcher" and "autonomous AI research intern") as its "North Star" goal, aiming for AI to handle complex problems independently.
AI models are becoming "better first drafters" for tasks like coding and research, but human review and oversight remain crucial due to potential "catastrophic mistakes" and current limitations in generalization to higher-level topics.
The growth in engineering job openings despite AI's advancements suggests a shift in roles, where humans manage AI agents rather than performing all tasks directly.
Security concerns are paramount with agentic AI, as demonstrated by the LiteLLM compromise, highlighting the need for robust "shells" and oversight for AI models.
AssemblyAI's Universal-3 Pro Streaming offers industry-leading real-time speech-to-text transcription for voice agents, excelling in accuracy, speaker diarization, and dynamic prompting.

Resources

ARC-AGI-3 Paper: https://cs.AI/arxiv/7403127 (Mentioned at 3:12)
ARC-AGI-3 Benchmark Website: https://arcprize.org/ (Mentioned at 0:29)
AssemblyAI Universal-3 Pro Streaming: https://www.assemblyai.com/ (Mentioned at 11:51)
LLM.Council.AI: https://llm.council.ai/ (Mentioned at 14:44)
NetHack Paper: [Editor's note: specific link to the NetHack paper mentioned by Tim Rocktäschel to be verified in the official documentation] (Mentioned at 9:52)
OpenAI's GDP-Val Paper: [Editor's note: specific link to OpenAI's GDP-Val paper to be verified in the official documentation] (Mentioned at 13:21)
MIT Technology Review Article: [Editor's note: specific link to the MIT Technology Review article on OpenAI's automated researcher to be verified in the official documentation] (Mentioned at 12:38)
LiteLLM GitHub/PyPI: [Editor's note: specific link to LiteLLM's official repository or PyPI page to be verified for the compromised version] (Mentioned at 14:58)

All guides Lire en français →