DeepSeek context window: token limits, memory policy, and 2025 rules

Graziano Stefanelli
Aug 12
4 min read

DeepSeek applies clear token limits that depend on the model used.

DeepSeek’s models enforce a fixed context window, measured in tokens, that defines how much content the AI can handle per request. This includes your full input prompt, any previous chat history (if manually provided), system instructions, few-shot examples, and—in some cases—the model's output. The window size and behavior differ between DeepSeek's two main production models: DeepSeek Chat and DeepSeek Reasoner. Each comes with unique constraints and optimizations for reasoning and output generation.

DeepSeek Chat uses a 64,000-token context limit for full turns.

The default model behind the deepseek-chat endpoint (typically DeepSeek-V3-0324) accepts a 64K token input+output window. This includes all input text (user prompt, system messages, few-shot samples) plus the model’s reply. By default, the reply length is limited to 4,000 tokens, but can be extended up to 8,000 tokens.Once the total exceeds the 64K cap, the model truncates or fails to respond. Developers must manually manage long multi-turn conversations, as the API is stateless—meaning all previous context must be explicitly re-sent with every call.

DeepSeek Reasoner decouples input and output windows.

The deepseek-reasoner model (based on DeepSeek-R1-0528) supports a 64,000-token input limit, but allows significantly longer outputs. Unlike most chatbots, the output doesn’t count against the input budget.

You can receive up to 32K or even 64K tokens in output (including reasoning chains), depending on the max_tokens parameter. This makes it well-suited for step-by-step logic and chain-of-thought tasks, but requires you to manage your input size carefully.

Input always includes system prompts and reused content.

In both DeepSeek Chat and Reasoner, tokens are consumed by:

Your full prompt text
Any system-level or role-based instructions
Tool calls, formatting, or JSON scaffolds
Previously seen messages (if manually replayed)Each of these contributes to the 64K input ceiling. Even small additions—like temperature settings or role labels—consume tokens invisibly unless tracked. There’s no automatic memory between turns, so everything past must be reinserted.

Output from Reasoner includes reasoning content but must not loop back.

When Reasoner returns a result, it separates the reasoning trace (chain-of-thought) from the final reply. The field reasoning_content is purely output-only. Developers must not include it in the next prompt—doing so breaks the context management system.Instead, the model handles long reasoning flows in one pass, using its large output headroom, while developers only provide necessary inputs each time.

Token costs depend on language and character count.

DeepSeek provides a conversion estimate:

1 English character ≈ 0.3 tokens
1 Chinese character ≈ 0.6 tokensThis means that a 1,000-token input is roughly equivalent to 3,300 characters in English (about 600–900 words). The actual count may vary based on punctuation, casing, and formatting. For precise usage, developers should inspect the usage field in the API response, which details token breakdowns.

DeepSeek is stateless—multi-turn threads require explicit input replay.

Unlike assistants with persistent memory, DeepSeek requires you to include the conversation history in each request if you want the model to “remember” context. This adds up quickly in multi-turn chats. One workaround is using summaries or anchored references instead of full thread replay, especially as you approach the 64K input limit.

Context caching reduces token costs on repeated prompts.

When using the API, if two requests share a common prefix (e.g., system prompt or few-shot examples), DeepSeek applies context caching. This means those repeated tokens are not billed again.

You’ll see this reflected in the usage object, under fields like prompt_cache_hit_tokens and prompt_cache_miss_tokens. Tokens are cached in 64-token chunks, allowing you to reuse common setup blocks efficiently across requests.

Uploading files in the app consumes context but has unclear limits.

In the consumer DeepSeek app, you can upload documents, images, and PDFs. The system extracts text from files and uses it for answering. However, official documentation does not specify exact file size, number, or token parsing limits.

Online guides suggest informal ceilings (e.g., 50MB per file, 20–50 files at once), but these are not confirmed. As a rule of thumb, assume that the extracted text counts toward the same 64K window as typed input.

DeepSeek doesn’t impose hard rate limits, but throttling occurs.

There’s no published fixed rate quota (tokens per minute or requests per second) on the DeepSeek API. However, during periods of high traffic, the system applies dynamic throttling, and users may receive 429 errors or slow responses. This is particularly common during spikes in multi-turn workloads or long-format prompts that push near the context window.

Privacy and retention: DeepSeek collects and stores chat data.

According to its privacy policy, DeepSeek may store:

User inputs
Uploaded files
Device and network data
Chat history and usage logsThis data may be used for improving and training its models. Storage occurs in cloud servers based in mainland China (RPC). Users can delete chat history and exercise data rights via platform settings.

When context breaks, the model shows typical overflow symptoms.

If you exceed the 64,000-token input limit (or 4K/8K output cap in Chat), you may encounter:

Cut-off replies
Loss of earlier messages
Looping or repeated outputs
Prompt rejectionIn Reasoner, failure to allocate sufficient max_tokens may cause incomplete reasoning chains or missing final answers. Monitoring token use is critical.

Best practices to manage DeepSeek’s token window.

Summarize previous messages rather than replaying full history.
Use Reasoner for long outputs or multi-step logic.
Take advantage of context caching to reduce token cost.
Avoid inserting CoT outputs back into the prompt in Reasoner.
Start new threads once you're near the token ceiling.
Track usage actively to optimize request size.

DeepSeek’s limits are generous but demand precision.

With 64K tokens for input and up to 64K for output (in Reasoner), DeepSeek offers a wide operational window. But because it is stateless and sensitive to replay size, token planning is essential. Whether building workflows or just chatting, every character counts—and every prompt must fit within the model’s invisible envelope.

____________

DATA STUDIOS

datastudios.org