• DeepSeek R1’s 671 billion parameters cost roughly $5.6 million to train because only 37 billion activate per token, around 5.5% of total capacity.
  • MoE saves compute but not memory — all parameters must stay loaded in GPU VRAM, requiring up to 800 GB for the largest models.
  • Over 60% of open-source AI model releases as of late 2025 use MoE, democratizing frontier-quality AI beyond well-funded labs.

When DeepSeek released its R1 model in January 2025 and claimed it had been trained for approximately $5.6 million — compared to the reported $50–100 million that GPT-4 required — the number seemed too dramatic to believe. It wasn’t a trick. It was, in large part, architecture.

DeepSeek R1 uses a Mixture of Experts design. So does GPT-4 (widely reported, though OpenAI has never officially confirmed it). So does Gemini 1.5, Mixtral, Mistral Large 3, Kimi K2, and virtually every other frontier model on public leaderboards in 2026. According to NVIDIA, all top 10 most intelligent open-source models on the independent Artificial Analysis leaderboard use a MoE architecture.

The efficiency advantage is simple to state and counterintuitive in practice: MoE models can be enormously large by total parameter count while only activating a fraction of those parameters to process any single input. DeepSeek R1, for example, has 671 billion total parameters — but only 37 billion are active at any given moment.

The Core Idea: Specialized Sub-Networks

In a standard dense neural network, every input passes through every layer, activating every parameter. The full weight of the model is engaged for every token, regardless of whether that token needs deep mathematical reasoning, basic grammar correction, or code interpretation. It is computationally democratic and wildly inefficient.

Mixture of Experts solves this with specialization. Instead of one monolithic feedforward network, MoE replaces key layers with a set of separate sub-networks — the “experts.” Each expert develops a specialty over training. Some become adept at mathematical reasoning. Others handle natural language fluency, code generation, or factual retrieval. No expert does everything well; each does something well.

A second component, called the gating network or router, decides which expert handles each token. When a token arrives, the router evaluates it, assigns scores to each expert, and selects the top-k highest-scoring ones — typically two out of eight, or eight out of 256, depending on the model. Those selected experts process the token. The rest sit idle.

As described by one analysis of Mistral’s Mixtral 8x7B, the gating network acts like a traffic cop: it routes each incoming request to the most relevant experts, combines their outputs with learned weights, and produces the final result. Only the necessary experts are activated, saving computational resources.

The Numbers That Make This Significant

The practical impact is visible in the parameter counts and inference costs of leading models.

Mistral’s Mixtral 8x7B, released in December 2023, was the model that brought MoE to the open-source mainstream. It has 46.7 billion total parameters but only 13 billion active per token — roughly 28% of total capacity. According to Nebius’s analysis, Mixtral was able to compete with models that had 70 billion parameters at the time, while being more efficient than dense models of similar size, because it activates only the experts needed for each computation.

DeepSeek R1 pushes this further. With 671 billion total parameters and 37 billion active (about 5.5%), the model can carry the knowledge of an enormous system while incurring only a fraction of the inference cost. DeepSeek-V3 demonstrated training at 250 TFLOPs per second per GPU on 256 H100s, and its total training cost was estimated at roughly $5.6 million.

Moonshot AI’s Kimi K2 Thinking — ranked as the most intelligent open-source model on the Artificial Analysis leaderboard as of early 2026 — is a MoE model. NVIDIA noted that Kimi K2 sees a 10x performance leap on the NVIDIA GB200 NVL72 rack-scale system compared with H200s, which enables one-tenth the cost per token.

The One Thing MoE Does Not Save

There is a trade-off that is critical to understand and often overlooked: MoE saves compute, not memory.

Because the router needs to be able to route any token to any expert, all experts must be loaded into GPU memory at all times. The full parameter count — all 671 billion for DeepSeek R1, for example — still needs to fit in hardware. As one practitioner guide notes, DeepSeek R1 still requires around 800 GB of GPU memory in FP8 format. Running it locally requires a server with at least eight NVIDIA H200 GPUs.

This distinction — cheap to run, expensive to host — is why MoE economics work better at scale. Cloud providers and large enterprises can afford the memory overhead and amortize it across thousands of users. Individual developers or small organizations face a hardware requirement that limits which MoE models they can run locally.

How the Router Learns (and Can Fail)

Training a MoE model well is harder than training a dense one, because of a failure mode called load imbalance. Left to their own devices, routers tend to discover a few popular experts and send most tokens to them — the rich-get-richer problem. Overloaded experts become well-trained on a narrow slice of inputs. Neglected experts waste capacity and produce poor outputs when finally called upon.

Modern MoE architectures address this with auxiliary losses during training — penalty terms that push the router to distribute tokens more evenly across all experts. DeepSeek’s design uses dynamic bias adjustments for the same purpose, without requiring the fixed auxiliary loss that can constrain model expressivity.

DeepSeek introduced a further refinement: shared experts that are always activated regardless of routing, alongside the standard routed experts. As described in their research, shared experts capture common knowledge that applies across all contexts, while routed experts handle domain-specific reasoning. This prevents duplication among routed experts and lets each one develop cleaner specializations.

Why Open Source Dominates MoE

One of the more striking facts about the 2025–2026 MoE landscape is that virtually all the major advances are open-source. Mixtral, DeepSeek V3 and R1, OLMoE, Kimi K2 — all released under permissive licenses. According to December 2025 data, MoE now powers over 60% of open-source AI model releases.

This matters for the broader competitive picture. The efficiency gains that MoE provides are available to anyone, not just labs with frontier training budgets. A well-tuned MoE model with 50 billion total parameters and 10 billion active can compete with dense models that cost five times more to run, and that democratization is reshaping which organizations can deploy frontier-quality AI.

Connection to the Frontier

MoE is not an isolated technique — it works in concert with other architectural improvements covered on Frontierbeat. Sparse attention reduces the cost of the attention layers. MoE reduces the cost of the feedforward layers. Together, they address the two main computational bottlenecks in transformer-based models.

The result is a trajectory where the most capable models are simultaneously getting cheaper to run, not just more capable. Understanding MoE is understanding why that trajectory is possible.


See also: What Is Sparse Attention | What Is NVIDIA Blackwell Ultra | DeepSeek V3.2-Exp

Leave your vote