• Gemini 2.5 Pro leads with a 2M-token context window, while Anthropic made Claude’s 1M-token limit generally available in March 2026.
  • Larger context windows drive exponentially higher compute costs—processing 1M tokens can cost roughly 250 times more than a 4K-token query.
  • Despite expanded limits, models tend to underweight tokens in the middle of long contexts, raising reliability concerns for document-heavy use cases.

When you paste a 200-page legal contract into ChatGPT and ask it to summarize the key risks, something invisible happens: the model reads everything, holds it all in memory, and uses all of it to generate your answer. That “memory” is called the context window, and it’s becoming one of the most contested battlegrounds in AI.

According to December 2025 analysis, million-token context windows are now production-ready with Gemini 2.5 Pro at 2M tokens and Claude Sonnet 4 at 1M tokens. In March 2026, Anthropic announced that the 1 million token context window is now generally available for Claude Opus 4.6 and Claude Sonnet 4.6.

This arms race isn’t academic. Every extra token of context costs compute, memory, and ultimately, money. According to Gartner’s September 2025 report, worldwide spending on AI is forecast to total nearly $1.5 trillion in 2025. Context-intensive applications are a major driver of this spending.

Understanding the context window is essential to understanding why AI infrastructure spending has reached historic levels—and why the race to expand it may be the defining investment thesis of the decade.

What the Context Window Actually Is

A context window is the amount of text a language model can consider at once when generating a response. When you send a prompt to an AI, everything in that prompt—your question, any instructions, uploaded documents, previous conversation history—counts toward the context window.

The technical term for chunks of text is “tokens.” Tokens aren’t quite words; they’re subword units optimized for model efficiency. Roughly, 1,000 tokens equals about 750 words. A 200,000-token context window therefore holds roughly 150,000 words—equivalent to a medium-length novel.

When you exceed the context window, something important happens: the model forgets. It can’t process more than its limit, so either you truncate earlier content, or the system simply ignores it. For users, this manifests as AI “forgetting” earlier parts of long conversations or failing to analyze documents that exceed the limit.

The practical implications of context window size are significant. A 50-page PDF can be fully analyzed with a 200K-token window—with a 4K-token window, you’d need to carefully select excerpts. Understanding a 10,000-line codebase requires context that exceeds most early LLM limits, and larger contexts enable AI coding assistants to see entire projects. If you’re debugging code over a 50-message conversation, a small context window means the AI forgets earlier errors by message 20. Analyzing hundreds of pages of financial documents, legal contracts, or academic papers demands substantial context.

As one analysis notes, Google just dropped Gemini 2.5 with a 2 million token context window—roughly 1,500 pages, an entire novel plus your company’s codebase.

The 2025-2026 Race in Numbers

Here’s how the major players currently stack up. According to Codingscape’s 2025 comparison, Gemini 3 models support a 1 million token input context window and up to 64K tokens of output, with Gemini 3.1 Pro released February 19, 2026. According to AIMultiple’s 2026 analysis, Gemini offers the largest readily available context window at 2 million tokens with native multimodal processing across text, audio, and images.

In March 2026, Anthropic announced that Claude Sonnet 4 now supports 1M tokens of context, with autonomous execution capable of running complex workflows for up to 30 hours without human input. The Hacker News discussion captures the significance: having one million tokens of context window is valuable for understanding large codebases, summarizing books, and all sorts of demanding tasks.

The Technical Challenge

Making context windows larger isn’t trivial. The core issue is that processing a larger context requires proportionally more compute—a model examining 1 million tokens needs to attend to relationships between all those tokens, a problem that grows quadratically with context size.

Every token in the context must be processed during generation. A 1 million token context costs roughly 250x more to process than a 4,000 token context, even if the final answer is the same length. More context means more memory requirements and more computation per token generated.

As Google’s documentation explains, the capability comes with real complexity—acknowledging both what million-token windows make possible and what they demand from infrastructure.

The Economic Stakes

The context window race has serious financial implications. API pricing scales with context size—a million-token context might cost 10x more per query than a 128K-token context. Serving larger contexts requires more memory per GPU and more GPUs per request, driving demand for H100s, B200s, and custom silicon. Context window size is also a visible spec: a model with a 1M context window looks more capable than one with 32K, regardless of actual performance on any given task.

According to J.P. Morgan’s 2025 analysis, the AI industry needs to make $650 billion in annual revenue to deliver a 10% return on investments. Meta announced in July 2025 that it would spend up to $72B on AI infrastructure that year. According to Steve Brown’s 2025 year-in-review, total capex spending by the big five—Meta, Microsoft, Google, Oracle, and Amazon—exceeded $400 billion for 2025, or more than $1 billion per day.

Larger contexts come with hidden problems. Research increasingly shows that models don’t use all context equally—tokens at the beginning and end of long contexts get more attention while middle tokens get “lost,” meaning the effective context may be smaller than the nominal context. Longer contexts also provide more opportunities for models to misremember or conflate information, making detailed questions about specific facts in a 500-page document potentially unreliable.

For developers, processing million-token contexts can produce unexpectedly high bills, and a million-token context might take 30+ seconds just to process before generation even begins.

What This Means for Users

For most users, current context windows are already more than sufficient. Writing emails, coding small projects, and answering questions require only a few thousand tokens. The context arms race targets specialized use cases: legal analysis, financial research, large codebase understanding.

But these use cases are high-value. According to MindStudio’s analysis, Anthropic expanded Claude Opus 4.6 and Sonnet to 1 million tokens at no extra cost—with significant implications for agents, RAG, and long workflows. The Stanford HAI 2025 AI Index Report notes that AI business usage is rising globally, with context windows central to enterprise adoption.

The Infrastructure Investment Thesis

For investors, the context window race is an infrastructure story. Every increase in context size drives demand for more powerful GPUs with larger memory bandwidth, more server memory per node, more data center capacity, more power infrastructure, and more networking between servers.

Nvidia will spend $26 billion over five years on AI infrastructure. Google, Microsoft, Amazon, and Meta are each committing similar or larger amounts. Context-intensive applications justify this spending—if AI can process entire codebases, legal archives, or financial databases, enterprise customers will pay premium prices.

The context window is no longer just a technical parameter. It’s a proxy for capability, a driver of infrastructure investment, and increasingly, a competitive moat. Understanding it means understanding why AI companies are spending like the stakes are trillion-dollar opportunities.

Because they are.

Leave your vote