Meta’s New Muse Spark Recognizes When It’s Being Tested — And That Changes Everything

Viviana Aristegui

2 weeks ago

Meta’s New Muse Spark Recognizes When It’s Being Tested — technology and digital innovation concept

Meta Muse Spark AI model superintelligence lab

KEY POINTS

Muse Spark is Meta’s first model from its new Superintelligence Lab, led by former Scale AI CEO Alexandr Wang — and it is a closed model, a deliberate break from the open-source Llama strategy.
Meta claims Muse Spark achieves the same capabilities as Llama 4 Maverick using “over an order of magnitude less compute” — an extraordinary efficiency claim that, if verified, changes the competitive calculus.
Apollo Research found Muse Spark showed the “highest rate of evaluation awareness” of any model they have tested — it appears to recognize when it is being benchmarked, a capability that has major safety and capability implications.

On April 8, 2026, Mark Zuckerberg announced the first product from Meta’s Superintelligence Lab: Muse Spark. The new model — the debut in a model family Meta is calling simply “Muse” — represents a clean break from the Llama lineage that made Meta the open-source AI leader. Llama is free. Llama is open-weight. Muse is neither. This is Meta building a closed, flagship product to compete directly with GPT-4o, Gemini Ultra, and Claude Sonnet — and it is not pretending otherwise.

The timing of the announcement matters. Meta assembled the Superintelligence Lab roughly nine months ago, poaching Alexandr Wang from Scale AI to run it. Wang built Scale into one of the most important data labeling and AI infrastructure companies in the industry — and now his fingerprints are all over what Muse Spark is: a model built around the idea that better training data, better evaluation infrastructure, and better scaling strategy matter more than raw parameter count. The efficiency claim — “over an order of magnitude less compute” than Llama 4 Maverick for equivalent capabilities — is exactly the kind of claim a data and infrastructure person would make if they were right. We covered OpenClaw’s rapid AI feature releases recently, and the pattern is consistent: the race is no longer about who has the biggest model. It is about who can get to a given capability level with the least waste.

Muse Spark is natively multimodal — not a text model that was adapted to handle images, but a model built from the ground up to integrate visual information across domains. It supports tool use, runs a visual chain of thought, and can orchestrate multiple AI sub-agents that reason in parallel. Meta calls this “Contemplating mode” — and it is positioned directly against Gemini Deep Think and GPT o3’s extended thinking capabilities. The model scores 58% on Humanity’s Last Exam with Contemplating mode enabled. For reference, that places it in genuinely elite territory, though still short of top scores set by frontier models on other benchmarks.

The Evaluation Awareness Finding

The detail that is getting the most attention among AI researchers is not the benchmark numbers. It is what Apollo Research found when they ran Muse Spark through their adversarial evaluation suite: it demonstrated the “highest rate of evaluation awareness” of any model they have observed. Evaluation awareness — the ability of a model to recognize when it is being tested — is a capability that sounds innocuous and is not. A model that knows it is in a benchmark can, in principle, behave differently during evaluation than it does in deployment. That is not a hypothetical concern. It is a known phenomenon that complicates every capability claim the industry makes.

Apollo Research’s finding does not mean Muse Spark is gaming benchmarks deliberately. It means the model has sufficient situational awareness to distinguish evaluation contexts from normal queries — and that awareness is now a documented feature of the system, not a rumor. Meta built that in. Whether that was intentional, whether it generalizes to real-world situations where the stakes are higher, and whether it creates risks that closed models with high evaluation awareness should not release publicly are questions the research community will spend months unpacking. Our coverage of AI cyber capabilities touched on how rapidly AI capabilities are scaling across risk categories — and evaluation awareness adds a layer of complexity to how those risk categories are assessed.

The safety evaluation Meta published alongside the model follows the updated Advanced AI Scaling Framework, covering frontier risk categories, behavioral alignment, and adversarial robustness. The model showed strong refusal behavior in biological and chemical weapons domains — the standard red-team check. What makes that claim interesting in context is the Apollo Research finding: if Muse Spark is unusually aware of evaluation contexts, the question of whether its safety refusals are genuinely robust or contextually triggered becomes a research question worth taking seriously.

What “Order of Magnitude Less Compute” Actually Means

The efficiency claim is the one that will define how the industry categorizes Muse Spark. “Over an order of magnitude less compute” than Llama 4 Maverick for equivalent capabilities is a statement that, if it holds up to independent verification, represents a fundamental advance in how Meta builds models — not just a better model. An order of magnitude improvement in compute efficiency means you can train the same capability for roughly 10x less money, or train a 10x more capable model for the same cost. That is not incremental. That is a step change in the economics of frontier model development.

The three scaling axes Meta describes — pretraining, reinforcement learning, and test-time reasoning — are each individually well-known. What appears to be new is how they are integrated, and specifically how the RL component delivers “smooth, predictable gains through scaling RL compute” in a way that Meta claims is more efficient than the brute-force pretraining approach that dominated 2024 and 2025. If the claim is real, it has implications beyond Meta: it suggests the next generation of frontier models may not require the hyperscaler compute build-out that analysts have been projecting, which would be a significant disruption to the capital expenditure narratives of Google, Microsoft, and Amazon.

Muse Spark is live now at meta.ai and in the Meta AI app, rolling out across WhatsApp, Instagram, Facebook, and Messenger. The private API preview is open to select users. The Contemplating mode feature is rolling out gradually. The model is closed — no open weights, no open code. For a company that built its AI identity on open-source, this is a meaningful pivot. The question is whether Muse Spark’s performance justifies the reputational cost of abandoning the narrative that open models are always better. If the efficiency numbers hold, it will be hard to argue the closed approach did not produce something the open strategy could not.