ChatGPT context window: token limits, memory policy, and 2025 rules

Graziano Stefanelli
Aug 11
4 min read

Context and tokens are concrete units you can budget and measure.

A token is the model’s unit of text accounting. For typical English, a rough and serviceable estimate is 1 token ≈ ~4 characters ≈ ~¾ of a word; that gives 100 tokens ≈ ~75 words.

This lets you size prompts and outputs without surprises. In addition to visible text, advanced models may spend reasoning tokens internally; they’re counted for billing and still consume your budget.A context window is the total token budget a model can hold at once—inputs + outputs + hidden/system/tool messages. If a request would exceed that budget, the system must truncate something (usually older turns) or reduce the output allowance.

The app and the API follow the same arithmetic, but model choice sets the ceiling.

In the ChatGPT app, the effective window depends on the model you select (e.g., GPT-5 vs GPT-4o). In the API, you name the model explicitly and should size both prompt tokens and max output tokens so their sum stays below the window—remember that tool calls and system messages count too. A safe habit is to leave a headroom buffer for the completion and any tools the model might invoke.

The 2025 limits that matter in practice.

GPT-5 (chat-latest in API): 400k token context window. Use it for very long inputs or when you expect extended step-by-step reasoning; you still need to budget output.GPT-4o: 128k token context window with 16,384 max output tokens—strong default for most day-to-day work.o-series (o3-mini as reference): 200k token window with 100k max output tokens; useful when you need larger windows and long answers in one pass.These are nominal ceilings. Real throughput depends on what else is in the turn (tools, images) and on latency policies; plan with margin.

File uploads are not the same as in-turn context.

In ChatGPT and GPTs, each file can be up to 512 MB; text/document files are capped at ~2 million tokens per file for indexing. That cap governs ingestion, not what fits into a single turn. In other words, the system may index the whole file, but only a slice is pulled into the active context for a given reply.On ChatGPT Enterprise, the platform can include up to ~110k tokens from uploaded documents into a single answer’s context. Treat this as a working upper bound, not a guaranteed value for every request.

Losing context is a predictable outcome of exceeding the budget.

When the running total approaches the window, the system starts trimming older turns or silently compressing context. Symptoms include answers that ignore earlier constraints, fail to reference prior IDs, or restate definitions. If your max output is set high while the prompt is already large, the system may reject the call or deliver a shortened response because there is not enough room left for generation.

Memory features control what persists across sessions, not the per-turn window.

Saved memories can store stable preferences and facts you ask ChatGPT to remember, and you can view, edit, or turn them off at any time; Temporary Chat avoids using them. Chat history may also be referenced for relevance. These features affect what the assistant brings in, not how many tokens it can hold per turn—context window math still rules the turn.

Practical token equivalents help you budget prompts and outputs.

For quick planning, treat 1,000 tokens ≈ ~750 English words. A four-page, single-spaced brief (~2,500–3,000 words) is ~3.5k–4k tokens—comfortable for any modern model, but the answer and tool chatter must still fit. For precise work (legal, code reviews, batch jobs), run your text through a tokenizer before sending.

Workflows that keep context intact for long projects.

Chunk and stage: split large inputs into logical sections and process them step-wise, carrying forward running summaries instead of raw text.Inline anchors: tag entities and decisions with stable IDs (“[REQ-142]”, “Customer#A17”) and refer to those anchors, not to long verbatim passages.Tight prompts: keep system and tool definitions minimal; verbosity in tools is paid in tokens.Right-size max output: if the prompt is big, lower max output tokens so the sum stays within the window; request append-only outputs (e.g., “bullet diff only”) to reduce length.

Detecting truncation and recovering when it happens.

Ask for a self-check: “List the key constraints we’ve agreed on so far; if any are missing, say ‘MISSING’.” If constraints disappear, you’ve likely crossed the window.Use recap prompts: paste the last running summary plus current instructions, then continue in a new thread to reset accumulated history.Prefer reference mode over repetition: “Use section §3 from Summary-v7” instead of pasting §3 again.

Notes for teams and enterprise use.

Turn on data controls aligned with policy (e.g., exclude content from training; default for Team/Enterprise). Keep a local log of decisions and IDs so you can rebuild context deterministically in new threads. Standardize prompt templates with explicit sections—Context, Task, Constraints, Output spec—and a summary block that you roll forward each iteration. For file workflows, plan around the in-turn stuffing limit (e.g., ~110k tokens) rather than raw document size; it’s the best predictor of whether a single response will “see” enough of your source material.

Policy recap you can rely on when planning a session.

Know your model window (e.g., GPT-5 400k, GPT-4o 128k, o-series up to 200k), include all contributors (input + output + tools + system), and leave headroom. Treat file limits and ingestion caps as preprocessing constraints, not promises of what fits into a single reply. And remember: memory helps with persistence and personalization, but it doesn’t increase the per-turn context window.

____________

DATA STUDIOS

datastudios.org