Energy-Based Transformers: A New Direction for AI

The race to improve AI has largely focused on making models bigger, more parameters, more data, more compute. Yet a new study from researchers at UVA suggests we may need to think differently about how we handle training and inference from a foundational level. By fundamentally rethinking how AI systems approach reasoning through a novel architecture called Energy-Based Transformers, they achieved performance gains that outpace what iterations on existing model architecture alone can deliver. Their work demonstrates that teaching models to think differently, through iterative refinement rather than single-pass prediction and by shifting the objective from generation to a minimisation problem, can unlock capabilities that brute-force scaling cannot reach.

Energy-based models (EBTs) and transformers represent two distinct approaches in machine learning that have evolved along separate trajectories. Energy-based models assign a scalar energy value to each possible outcome, where lower energy indicates higher compatibility between an input and its prediction. Energy based modelling can be used for application that requires the comparison of two different elements, the nuance is just in the comparison. Transformers have meanwhile emerged as the dominant architecture for large-scale AI systems, valued for their stability and scalability when processing sequences. Energy-Based Transformers (EBTs) unite these two paradigms by using transformer architecture to implement an energy function. Rather than directly outputting predictions, EBTs refine their outputs through energy minimisation. This means each prediction undergoes its own optimisation process, effectively introducing a form of deliberate, step-by-step reasoning into the model's operation.

In this paper the authors show that this architectural change yields substantial improvements in both learning efficiency and performance. During pretraining, EBTs demonstrated scaling rates up to 35 percent higher than the standard Transformer++ approach across multiple dimensions including data usage, batch size, parameter count and model depth. This superior efficiency means EBTs extract more learning from the same computational resources. The benefits extend beyond training to inference time. When given additional computation, EBTs showed remarkable ability to improve their predictions in ways that conventional transformers cannot match. Language modelling tasks saw 29 percent greater performance gains from extended computation compared to Transformer++. In image denoising tasks, EBTs achieved better results than diffusion transformers whilst requiring 99 percent fewer forward passes.

Perhaps most significantly, EBTs excel at handling unfamiliar data. The research demonstrates that as data becomes increasingly out-of-distribution, the benefits of EBT's inference-time optimisation grow proportionally. Even when starting with similar or slightly worse pretraining performance, EBTs consistently outperformed traditional models on reasoning and question-answering benchmarks. This advantage stems from their fundamental design as verifiers rather than pure generators. By learning to assess the compatibility between predictions and context, EBTs develop more robust representations that transfer better to new domains. The architecture also naturally captures uncertainty and enables prediction verification, bringing machine learning systems closer to the deliberate reasoning processes that characterise human problem-solving.

These findings suggest a potentially pivotal moment for AI architecture development. If the observed scaling advantages persist at foundation model scale, EBTs could fundamentally reshape how we build large language models and other AI systems. The work demonstrates that moving beyond incremental refinements of existing transformer designs can yield transformative improvements. By reconceptualising how models make predictions through energy minimisation rather than direct generation, EBTs could offer a path towards systems that learn more efficiently and reason more effectively. This architectural innovation theoretically underscores a crucial principle for advancing AI capabilities. Progress requires not just scaling existing approaches but exploring fundamentally new ways for models to process information and make decisions.

‍

Resources: https://energy-based-transformers.github.io

‍

Intelligence as Architecture - how changing objective functions improves reasoning for AI models

Related posts

Press Release: vector8’s Recognition at the Trustworthy AI Summit Hosted at EDF Lab

The enduring art of problem finding

Lessons from AI tooling - how to build better AI systems by focussing on what works.

The other side of the GenAI divide: how visionary leaders are winning the AI revolution

OpenAI's GPT-OSS matters. Because it’s …sensible.

vector8 expands Swiss presence and strengthens AI transformation capabilities with two strategic acquisitions