Grok AI Context Window, Token Limits, and Memory: architecture and usage rules

Graziano Stefanelli
Oct 5
4 min read

Grok AI, developed by xAI, has positioned itself as a long-context model suite with distinct token capacities, pricing rules, and a stateless memory design. The models Grok 4 and Grok 4 Fast support extended context windows that surpass most competitors, but they enforce specific thresholds for billing and request structure. Understanding these mechanics is essential for developers who want to deploy Grok efficiently across workloads that range from summarizing large contracts to powering conversational assistants.

·····

.....

The context windows supported by Grok models.

Grok’s two flagship models—Grok 4 (grok-4-0709) and Grok 4 Fast (reasoning)—offer different context sizes.

Grok 4 supports a 256,000-token context window. While this represents a very high capacity, requests that exceed 128,000 tokens are billed at a higher “extended context” rate.
Grok 4 Fast extends the limit to 2,000,000 tokens, making it one of the largest available context windows in production models. Like Grok 4, it applies tiered pricing when requests cross the 128,000-token threshold.

Older Grok models such as Grok 3 Mini operated at lower limits around 128,000 tokens. In 2025, the official model cards confirm that Grok 4 and Grok 4 Fast are the active endpoints for long-context workloads.

·····

.....

Token pricing and throughput constraints.

Grok models apply distinct token economics depending on whether input tokens are fresh or cached. The pricing and throughput specifications are as follows:

Grok 4:
- Input tokens: $3 per million.
- Cached input tokens: $0.75 per million.
- Output tokens: $15 per million.
- Throughput: 2,000,000 tokens per minute, up to 480 requests per minute.
Grok 4 Fast:
- Input tokens: $0.20 per million.
- Cached input tokens: $0.05 per million.
- Output tokens: $0.50 per million.
- Throughput: 4,000,000 tokens per minute.

Caching discounts significantly reduce cost for prompts that reuse large prefixes, such as system instructions or schema templates. This design encourages developers to maintain stable prompt structures.

·····

.....

Table — Grok AI token limits and costs.

Model	Context window	Tokens/minute	Input price (per M)	Cached input (per M)	Output price (per M)
Grok 4	256,000	2,000,000	$3	$0.75	$15
Grok 4 Fast	2,000,000	4,000,000	$0.20	$0.05	$0.50

This table shows the wide gulf between the premium pricing of Grok 4 and the low per-token cost of Grok 4 Fast, reflecting their different roles in the ecosystem.

·····

.....

Live search and external grounding costs.

When Grok performs Live Search to ground responses in current information, additional billing applies. The system charges $25 per 1,000 sources used, which equates to $0.025 per source. Each API call reports the number of sources retrieved, allowing developers to monitor and control costs. Developers should adjust num_sources to match accuracy requirements while avoiding excessive search fees.

·····

.....

Output token handling and request composition.

In Grok, the input tokens, output tokens, and images (converted into token equivalents) all count toward the total context window. Images typically consume between 256 and 1,792 tokens each, depending on size. While the model cards do not specify a hard maximum for output tokens alone, the combined request must remain within the declared context window.

Pricing differences above 128,000 tokens apply regardless of whether the content is input or output, so careful prompt engineering is required to balance performance with cost.

·····

.....

How Grok manages memory across sessions.

Unlike some consumer-facing assistants, Grok’s API is stateless. This means the model does not retain knowledge of prior messages unless developers explicitly resend them in the messages array. xAI’s documentation stresses that:

The API does not remember earlier requests by default.
Developers must implement their own conversation history storage.
The Responses API allows referencing a previous response.id to chain conversations, but this is still controlled by the developer and does not represent true long-term memory.

Consequently, any memory functionality in applications built with Grok must be implemented externally, often through database storage, embeddings, or summary injections.

·····

.....

Practical recommendations for using Grok.

Choose the right model: Grok 4 is designed for premium quality with 256,000 tokens, while Grok 4 Fast offers cost-effective processing for extremely large inputs up to 2 million tokens.
Control token usage: Keep prompts modular and use cached inputs wherever possible to take advantage of lower pricing tiers.
Design external memory: Applications that require continuity should maintain their own memory store, retrieving relevant context per request.
Budget for live search: Since each source incurs cost, enable search only when live grounding is essential.
Monitor throughput: Align workloads with the tokens-per-minute limits to avoid throttling in high-volume scenarios.

·····

.....

Operational impact for enterprises and developers.

For enterprises, Grok’s very large context windows make it suitable for contract analysis, compliance checks, or long document summarization at scale. For developers, Grok 4 Fast is a powerful choice for RAG pipelines, enabling ingestion of millions of tokens in a single context while keeping costs lower than traditional premium LLMs. Across both models, the lack of built-in memory reinforces the importance of designing data persistence and recall layers externally.

By combining extreme context capacity, distinct pricing rules, and a stateless architecture, Grok AI offers developers both flexibility and responsibility in how they design prompts, store conversation history, and balance costs.

.....

DATA STUDIOS

.....[datastudios.org]