Researchers Kill the ‘Next-Token’ Paradigm—And LLMs Could Get 4x Faster

Hermes Ladiz

5 days ago

Researchers from Tencent’s WeChat AI team and Tsinghua University released CALM, a new language model architecture that replaces discrete token prediction with continuous vector prediction.
The system compresses four tokens into a single continuous vector using a high-fidelity autoencoder, maintaining 99.9% reconstruction accuracy.
This cuts the number of autoregressive generation steps by a factor of four, potentially quadrupling inference speed without sacrificing output quality.

Every large language model on the market shares the same fundamental constraint: they generate text one token at a time, sequentially, like a typewriter with a PhD. A new paper from researchers at Tencent’s WeChat AI division and Tsinghua University proposes to kill that constraint entirely—not by making token generation faster, but by making the token itself obsolete.

The paper introduces CALM (Continuous Autoregressive Language Models), a framework that replaces discrete “next-token prediction” with continuous “next-vector prediction.” Instead of predicting the next word from a fixed vocabulary of 32,000 options, CALM predicts a dense numerical vector that represents an entire chunk of meaning at once. The shift from tokens to vectors opens a new scaling dimension the authors call “semantic bandwidth”—the amount of information processed in a single generative step.

How it Works CALM, the Language Model That Thinks in Vectors

CALM uses a two-stage architecture. First, an autoencoder learns to compress a chunk of K tokens into a single continuous vector—then reconstruct the original tokens from that vector. At K=4, the system maintains over 99.9% reconstruction accuracy, according to the paper. Second, a language model performs autoregressive prediction in this continuous vector space, predicting the next vector instead of the next token.

The practical upshot is straightforward: if you compress four tokens into one vector and predict that vector, you cut your generative steps by a factor of four. The model doesn’t generate faster in the traditional sense—it just generates more per step. The code is available on GitHub, and the team reports that CALM “significantly improves the performance-compute trade-off” compared to standard discrete baselines.

The trade-off is that continuous-domain modeling requires a completely different toolkit. You can’t use standard likelihood-based training when your output space is continuous vectors rather than discrete probability distributions. So the Tencent-Tsinghua team developed what they call a “likelihood-free framework” that includes energy-based training for generative modeling, a new evaluation metric called BrierLM, and a temperature sampling method adapted for the continuous domain.

The timing matters. The AI infrastructure race is already costing tens of billions in compute spend, and inference efficiency is the single biggest lever for bringing those costs down. If CALM’s approach holds at scale, it wouldn’t just make models faster—it would reshape the economics of running them.

Whether this approach scales to production-grade models remains an open question—the paper tests CALM on models trained from scratch, not on retrofitting existing giants like GPT-4 or Claude. But the argument is compelling: we’ve spent years widening the road (larger vocabularies, better tokenizers, speculative decoding) when the real bottleneck is the road itself. The push for on-device, efficient AI only makes a breakthrough like this more urgent. CALM doesn’t widen the road. It builds a highway next to it.