Quick Cyber Thoughts: Absolute Zero Reasoning

No, this isn't a joke about/reference to vibe coders.

Instead, it's a new method of training reasoning AI:

What is it? Well, it just could be the thing all the AI safety people have been warning about.

Table of Contents

How it Works

Absolute Zero Reasoning, as explained by its creators, is the following:

To this end, we propose “Absolute Zero”, a new paradigm for reasoning models in which the model simultaneously learns to define tasks that maximize learnability and to solve them effectively, enabling self-evolution through self-play without relying on external data. In contrast to prior self-play methods that are limited to narrow domains, fixed functionalities, or learned reward models that are prone to hacking (Silver et al.2017Chen et al.20252024), the Absolute Zero paradigm is designed to operate in open-ended settings while remaining grounded in a real environment. It relies on feedback from the environment as a verifiable source of reward, mirroring how humans learn and reason through interaction with the world, and helps prevent issues such as hacking with neural reward models (Hughes et al.2024). Similar to AlphaZero (Silver et al.2017), which improves through self-play, our proposed paradigm requires no human supervision and learns entirely through self-interaction. We believe the Absolute Zero paradigm represents a promising step toward enabling large language models to autonomously achieve superhuman reasoning capabilities.

-Zhao, et al, https://arxiv.org/html/2505.03335v2

In plain English, Absolute Zero Reasoning is a training method where an LLM creates a coding or mathematics challenge that's calibrated to be challenging, then uses certain functions in the Python environment to validate the results. Successes and failures train the LLM as it iterates through the tasks, improving it capabilities through the use of three methods of reasoning.

The Key Takeaways

Helpfully, the paper has a summary of the key findings of this research:

Code priors amplify reasoning. The base Qwen-Coder-7b model started with math performance 3.6 points lower than Qwen-7b. But after AZR training for both models, the coder variant surpassed the base by 0.7 points, suggesting that strong coding capabilities may potentially amplify overall reasoning improvements after AZR training.
• Cross domain transfer is more pronounced for AZR. After RLVR, expert code models raise math accuracy by only 0.65 points on average, whereas AZR-Base-7B and AZR-Coder-7B trained on self-proposed code reasoning tasks improve math average by 10.9 and 15.2, respectively, demonstrating much stronger generalized reasoning capability gains.
Bigger bases yield bigger gains. Performance improvements scale with model size: the 3B, 7B, and 14B coder models gain +5.7, +10.2, and +13.2 points respectively, suggesting continued scaling is advantageous for AZR.
Comments as intermediate plans emerge naturally. When solving code induction tasks, AZR often interleaves step-by-step plans as comments and code (Figure 19), resembling the ReAct prompting framework (Yao et al.2023). Similar behavior has been observed in much larger formal-math models such as DeepSeek Prover v2 (671B) (Ren et al.2025). We therefore believe that allowing the model to use intermediate scratch-pads when generating long-form answers may be beneficial in other domains as well.
• Cognitive Behaviors and Token length depends on reasoning mode. Distinct cognitive behaviors—such as step-by-step reasoning, enumeration, and trial-and-error all emerged through AZR training, but different behaviors are particularly evident across different types of tasks. Furthermore token counts grow over AZR training, but the magnitude of increase also differs by task types: abduction grows the most because the model performs trial-and-error until output matches, whereas deduction and induction grow modestly.
• Safety alarms ringing. We observe AZR with Llama3.1-8b occasionally produces concerning chains of thought, we term the “uh-oh moment”, example shown in Figure 32, highlighting the need for future work on safety-aware training (Zhang et al.2025a).

-Zhao, et al, https://arxiv.org/html/2505.03335v2

It's fascinating to see that pre-training a model on coding can provide a boost to overall reasoning capability. Absolute Zero Reasoning leading to better math reasoning after training models to code also makes sense. A lot of code is just a complicated math problem, and learning how to code more effectively could help the AI figure out better methods of solving those problems.

However, there's two takeaways worth thinking about a little further.

Bigger Model Better?

AZR Training In-Distribution Accuracy over 200 training steps, for a Llama 8B parameter model and AZR trained 3, 7, and 14B parameter models

The results reveal a clear trend: our method delivers greater gains on larger, more capable models. In the in-distribution setting, the 7B and 14B models continue to improve beyond 200 training steps, whereas the smaller 3B model appears to plateau. For out-of-distribution domains, larger models also show greater overall performance improvements than smaller ones: +5.7, +10.2, +13.2 overall performance gains, respectively for 3B, 7B and 14B. This is an encouraging sign, since base models continue to improve and also suggesting that scaling enhances the effectiveness of AZR. In future work, we aim to investigate the scaling laws that govern performance in the Absolute Zero paradigm.

-Zhao, et al, https://arxiv.org/html/2505.03335v2

There's a presupposition that we're in the midst of an AI bubble due to a lot of companies buying a lot of expensive hardware to handle the compute requirements of training AI, and gaining little to show for that investment.

Absolute Zero Reasoning might flip that wisdom on its head. With bigger models not only having higher starting accuracy, but also continuing to improve as the training time increases, all that hardware suddenly seems very useful. If you have a boatload of compute, a huge amount of memory, and plenty of time, you could potentially use it to create a model that would surpass a human in coding and mathematical reasoning.

That has a number of long term strategic implications across a wide variety of fields:

  • Demand for the latest AI hardware by foundation model makers could skyrocket, exacerbating supply issues.
  • Older AI hardware would be unlikely to filter into the secondary market, because lack of availability for newer hardware.
  • Smaller/less well funded organizations would have difficulties improving or standing up local AI capabilities due to lack of supply of new or older hardware.
  • Frontier AI firms would be able to further consolidate and dominate the AI market, as those with the most compute would benefit the most from AZR.
  • Software security would basically become a test of whether developers had access to the most capable AZR model and utilized (system) prompts aiming to create secure systems by default.

This is naturally a pessimistic outlook, one that could literally be obsoleted next week at the earliest by AMD at Computex 2025. Mass production of products with their currently unused Neural Processing Unit chiplets would certainly ease the supply constraints on decent AI hardware. Improvements in software could also help ease these issues, as better CUDA translation layers for other architectures could allow for more performance to be obtained from hardware that is less in demand.

The Safety Problem

Absolute Zero Reasoner – Llama3.1-8B “Uh-oh Moment.” This example highlights an unexpected and potentially unsafe reasoning chain generated by the Absolute Zero Reasoner–Llama3.1-8B model during training. Although the paradigm enables reasoning improvements without human-curated data, it may still require oversight due to the risk of emergent undesirable behaviors.

AI Safety is a bit of a contentious topic these days, but I believe the common ground is that we should at least be aware of what an autonomous AI system is doing.

So the fact that AZR models produced the above reasoning text is a bit concerning. Because research from Anthropic tracing the neural pathways and behaviors of LLMs has revealed that the output of Chain of Thought is not necessarily an accurate replication of how the neural net actually generated that result. It can be a distorted or deliberately incorrect statement, which is already concerning, but it gets worse with a self-learning system.

For organizations that have a low risk tolerance, this set of behaviors may slow adoption of AZR, at least until the safety concerns are dealt with.

Unfortunately, this means that any organization that does not care about the risks or feels confident in their ability to mitigate them will adopt the methodology before those risk adverse organizations. And among those that would be more inclined to adopt first, deal with the risks later are nation-state actors and Advanced Persistent Threat groups. The reason for this is pretty obvious - better coding AI, especially ones that can stump/outthink experts with AI assistance, would make their jobs much easier.

This that defenders need to consider the ramifications of facing off against these potential super coder AIs now, before they're anything more than the subjects of academic papers.