ChatGPT: token limits and context windows updated for all models in 2025

Graziano Stefanelli
Aug 18
6 min read

The capacity of ChatGPT to process, retain, and reference information depends on the size of its context window and the token limits set by each model version.

These technical ceilings determine how much history, uploaded content, or document text the system can “remember” in a single chat. For users drafting long reports, analysing complex files, or building workflows around ChatGPT, knowing these updated limits is essential to avoid lost context or incomplete answers.

Each ChatGPT model tier has a distinct context window.

Every message, prompt, or uploaded file consumes tokens. When the session reaches the token ceiling, ChatGPT will automatically begin to forget the oldest turns—potentially losing instructions, prior data, or the thread of a multi-step analysis. The available memory varies by model and mode, and it’s not always the same as the limit advertised for the API version.

The difference in context window sizes is not simply academic: it shapes what you can do in a single chat. For a short, interactive conversation, almost any model will suffice. However, when you are working on extended projects—like multi-section reports, legal reviews, or scientific papers—the window size directly determines whether ChatGPT can recall your earlier instructions or references from previous uploads. Users relying on persistent context, such as technical analysts or enterprise researchers, often encounter the limits of the smaller models sooner than expected.

Model & Mode in ChatGPT	Context window (tokens)	Typical usage scenario
GPT-5 Fast	128,000	Day-to-day drafting, summaries, emails
GPT-5 Thinking	196,000	Deep reasoning, long-form analysis
GPT-4.1 (chat)	32,000	Standard Q&A, mid-length content
GPT-4o (legacy)	32,000	Legacy support, fast general tasks
o3 / o4-mini	200,000	Extended memory, research, book-length input

In the API, GPT-5 can offer up to 400,000 tokens of context, and GPT-4.1 (via API) is now available with up to 1,000,000 tokens—far exceeding what’s available in the main ChatGPT interface. These high ceilings support use cases like processing entire books, large codebases, or vast chat logs, but only for developers integrating directly with the API. For most web and app users, staying within the ChatGPT interface means working within the more restrictive—but still impressive—context budgets above.

System overhead and reply limits affect usable memory.

The headline context window is not entirely available for user prompts and replies. ChatGPT always reserves a portion—typically 750 to 900 tokens—for system instructions, routing, and safety logic. This means a “128,000” model provides about 127,000 tokens for real user and file content.

This system overhead is mostly invisible to the user, but it’s critical when working near the model’s maximum capacity. If you attempt to paste the full text of a 128,000-token book into GPT-5 Fast, some of your content may be cut simply to make room for the background safety and routing layers. For tasks where absolute completeness and context retention are crucial, it’s always best to allow an extra 1,000-token buffer.

In addition, there is a cap on any single model reply. Even with a large context window, GPT-5 Fast and GPT-5 Thinking will not generate more than 8,000 tokens in one answer inside the ChatGPT app. This affects users attempting to request entire chapter drafts, lengthy codebases, or long-form data exports in a single output. The API allows higher output caps (up to 128,000 tokens per reply), but not through the standard chat interface. Segmenting your requests into logical chunks can help ensure that outputs remain within the practical reply limits.

File uploads and document processing count toward the same window.

Uploading PDFs, PowerPoints, or spreadsheets into ChatGPT will quickly use up available memory. The platform compresses and strips out non-text elements from files to reduce token load:

A 20 MB PDF may be compressed to 30,000–40,000 tokens if it’s mostly text.
Decorative images, background graphics, and repeated headers are removed.
Large files may require splitting or trimming to fit the active context window, especially in enterprise plans.

This file handling logic means that even large files can sometimes “fit” within your context window, though the content that is visible to ChatGPT may not be identical to the original file. When working with complex PDFs—such as scientific papers with figures and tables—users should expect tables to be preserved, but graphics and annotations might not appear in the assistant’s memory. For teams routinely uploading long reports or presentations, it’s a good practice to pre-process files, remove unnecessary slides or sections, and focus on the most relevant content.

Uploaded content shares the context window with chat history, so long sessions with multiple attachments can cause early truncation of previous messages. To maximise utility, users may need to periodically summarise earlier conversation points, then clear the thread and start fresh with only the essential context.

Temporary model fallback and auto-downgrade affect context in high-traffic periods.

OpenAI manages usage spikes by temporarily downgrading active chats to earlier models. For example, free-tier sessions on GPT-5 Fast may switch to GPT-4.1 if GPU pools are overloaded, reducing the context window from 128,000 down to 32,000 tokens. A banner will notify users when this occurs.

Enterprise customers are insulated from this fallback and maintain contracted context windows even during peak hours.

This auto-downgrade mechanism is invisible in everyday use unless you’re monitoring for sudden changes in memory or output quality. For writers and analysts working with large files or expecting persistent memory, it’s important to pay attention to on-screen notifications or warning banners, especially when the platform is experiencing heavy demand. While the shift is temporary, it can result in sudden truncation or the loss of earlier uploads, so keeping offline backups of key prompts and files is highly recommended for professional users.

Updated context window comparison table (2025).

Model / Platform	ChatGPT window	API window	Reply cap in chat	Notes
GPT-5 Fast	128,000	400,000	8,000	Default in Plus/Team, fallback to 4.1 on load
GPT-5 Thinking	196,000	400,000	8,000	For deep reasoning and code
o3 / o4-mini	200,000	200,000	8,000	Longest window in consumer chat
GPT-4.1	32,000	1,000,000	8,000	API enables much larger window
GPT-4o (legacy)	32,000	128,000	8,000	Still selectable for some users

This table offers a direct visual comparison for users deciding which tier or integration to use for different types of projects. Developers leveraging the API gain access to much larger context windows, but everyday users on the ChatGPT platform must plan within the practical ceilings noted above.

Practical guidance for users and teams.

Always allow for system overhead—subtract ~1,000 tokens from the published window for real work.
For critical workflows, budgeting for overhead prevents accidental truncation or message loss when the context window fills up.
Watch the reply limit—even large windows will not generate more than 8,000 tokens per answer in the app.
Users drafting lengthy reports or multi-section outputs should split requests and review responses for completeness.
Compress large uploads—pre-process or split long files for better handling, and check which parts are “visible” to the model after upload.
Reducing unnecessary content before uploading saves memory and improves retrieval accuracy.
Check for model fallback—watch for banners in busy periods that signal a reduced context window, especially on free or Plus accounts.
Staying aware of the active model ensures you can react quickly if an auto-downgrade disrupts your workflow.
Enterprise users benefit from pinned models and context windows even under heavy load.
Teams operating at scale should work with their OpenAI account managers to define custom retention and memory settings where possible.

How to choose the right model for your workload.

Use case	Recommended model/mode
Quick summaries and simple drafting	GPT-5 Fast (128,000)
Legal reviews and multi-section research	GPT-5 Thinking (196,000) or o3
Book-length input via API	GPT-5 API (400,000) or GPT-4.1 API (1,000,000)
Heavy upload and document analysis	o3 / o4-mini in chat
Cost-sensitive, short context summaries	GPT-4o (legacy)

Selecting the correct model isn’t just about context size. Consider your project’s complexity, collaboration needs, and the type of files you routinely upload. If you regularly push the limits of ChatGPT’s memory, it may be time to experiment with API-based workflows, which provide both the largest windows and the greatest customisation.

Understanding and working within these updated token and context limits ensures that ChatGPT remains accurate, context-aware, and capable of tackling extended, multi-part tasks. By choosing the right model, planning uploads for the available memory window, and watching for system overhead, you can keep your workflow smooth and your results as complete as possible—no matter how complex your conversation or file input.

____________

DATA STUDIOS

datastudios.org