ChatGPT — Context Window, Token Limits, and Memory: how session recall and long input handling work

Graziano Stefanelli
Oct 27, 2025
4 min read

ChatGPT’s performance in extended tasks—code review, document analysis, conversation recall—depends on how it manages context windows, token limits, and its new memory system. These three layers define what the model can read, remember, and reason about at any given time. Understanding their interaction helps users build workflows that stay coherent across long chats or large file inputs.

·····

.....

The structure of ChatGPT’s context and memory systems.

Every model version in ChatGPT has two types of storage:

Context window: a temporary workspace (in tokens) that holds the current conversation, recent turns, and uploaded files. It is erased when the session ends.
Persistent memory: a long-term recall layer that remembers facts, preferences, and relationships across sessions—available only in supported regions and accounts.

Tokens represent chunks of text (≈4 characters on average). Both the input and output consume tokens from the same window. When that limit is reached, older content is dropped or summarized internally to fit within the cap.

·····

.....

Context window sizes across models.

Model	Context window	Approx. word capacity	Notes
GPT-3.5 Turbo (Free)	~16,000 tokens	≈12,000 words	Default for free-tier users
GPT-4o (Plus / Team / Enterprise)	128,000 tokens	≈96,000 words	Handles long docs and complex reasoning
GPT-4o-mini	128,000 tokens	≈96,000 words	Lightweight, cheaper, fast variant
GPT-4-Turbo (Legacy)	128,000 tokens	≈96,000 words	Being replaced by GPT-4o
GPT-4 / o3 / o3-pro (API)	128,000+ tokens	≈96,000+ words	API access for developers; scalable quotas

1 token ≈ 0.75 words, so a 128k window can hold roughly a 200-page book or dozens of combined files if properly chunked. Memory, by contrast, is unlimited in duration but highly selective in what it retains.

·····

.....

How ChatGPT trims and compresses context.

When your running conversation exceeds the window size, the model starts to summarize older turns and drop fine detail. This is automatic; you can’t disable it. Typical symptoms include:

Losing reference to details from early in the thread.
Reinterpreting instructions in slightly new ways after long exchanges.
Summarized versions of prior answers appearing in later reasoning.

Best practice: treat long workflows as multi-stage sessions. Use exported summaries, pinned notes, or memory features rather than a single endless chat thread.

·····

.....

How file length interacts with the context window.

When you upload a file, ChatGPT doesn’t ingest the entire document blindly—it chunks and indexes it. The chunked text counts against the same token window. A 100-page PDF (~40,000 tokens) plus your prompt (2,000) plus the reply (2,000) consumes ~44,000 tokens—safe for GPT-4o but over the limit for GPT-3.5.

Rule of thumb:

Under 15k tokens → fine for all models.
15k–100k tokens → use GPT-4o or 4o-mini.
Over 100k tokens → split files or summarize sections manually.

·····

.....

Persistent memory and how it differs from context.

The new ChatGPT Memory stores small, structured pieces of information—facts about you, your projects, tone preferences, and corrections. It’s not a transcript storage; it’s a semantic profile updated over time.

Feature	Context window	Memory
Scope	One conversation	Across sessions
Capacity	128k tokens max	Small, structured summaries
Editability	Automatic / temporary	User-viewable, can be cleared
Use case	File reading, reasoning	Personalized assistance

You can review or delete stored memory anytime in the Settings → Personalization → Memory menu.

·····

.....

Token budgeting for long tasks.

Task type	Typical token use	Recommended model
Email / short draft	1k–2k	GPT-3.5 Turbo
Article or blog synthesis	5k–15k	GPT-4o-mini
Multi-file research	20k–60k	GPT-4o
Book or large codebase analysis	80k–120k	GPT-4o (Team/Enterprise)

When you reach upper bounds, structure prompts as modular subtasks:

Summarize or extract key sections first.
Store those summaries locally.
Ask a final model call to synthesize them.

This staged pattern ensures no token overflow and better reasoning continuity.

·····

.....

Technical behavior when limits are exceeded.

If a request exceeds the available window:

In chat: ChatGPT automatically truncates earlier turns.
In API calls: It returns context_length_exceeded or 400 errors.
In voice sessions: The conversation resets silently to preserve speed.

Internally, the model still holds a rolling buffer of the most recent ~128k tokens.

·····

.....

Memory in team and enterprise environments.

Team and Enterprise editions support shared memory policies, letting admins control:

Data retention (organization-only or per-user).
Audit visibility of memory updates.
Opt-out flags for confidential projects.

For regulated environments, teams can disable personal memory while still benefiting from session context. Enterprise accounts store memory in tenant-isolated environments that comply with SOC 2 and ISO 27001 standards.

·····

.....

Example: token planning in practice.

Scenario: You upload a 60-page annual report (~25k tokens) and ask:

“Summarize financial highlights and extract key metrics for a slide.”

Prompt: 1,000 tokens
File text: 25,000 tokens
Model reply: 2,000 tokens

Total: 28,000 tokens → easily fits GPT-4o (128k).

A follow-up like “Compare this to last year’s report (50 pages)” adds ~20,000 more. Still fine under 128k, but approaching half the window—expect slower processing and minor compression of older turns.

·····

.....

Cost and performance impact (API view).

API usage bills per input and output token. When operating near 128k windows, expect longer latency and higher cost. For developers:

Use retrieval patterns (embedding + search) for repeat queries.
Cache intermediate summaries instead of re-feeding the same document.
Limit max_output_tokens to practical needs (e.g., 1k–2k) to control expense.

·····

.....

Practical guidelines for creators and analysts.

Split long workflows. One session per deliverable or phase.
Use memory intentionally. Let the model recall preferences, not whole projects.
Budget tokens. Large docs plus verbose replies can silently overflow limits.
Save summaries externally. Treat memory as metadata, not archive.
Switch models as needed. GPT-4o for deep work, 3.5 Turbo for drafts.
Reset occasionally. Long sessions accumulate summarization drift—fresh threads restore precision.

·····

.....

Quick reference table.

Parameter	GPT-3.5 Turbo	GPT-4o / 4o-mini	Team / Enterprise
Context limit	16k	128k	128k+
Persistent memory	❌	✅	✅ (managed)
File upload cap	~20 MB total	~200 MB total	~500 MB total
Session recall	Single chat	Multi-turn + memory	Organization policy
Best use	Short tasks	Long analysis, projects	Shared work environments

·····

.....

The bottom line.

ChatGPT manages three layers of recall—immediate context (128k tokens), summarized compression for long conversations, and persistent memory for personal continuity. Treat the context window as a workspace and memory as a notebook of key facts, not a database. By budgeting tokens and segmenting tasks, you can maintain accuracy, speed, and reliability even in the largest projects.

.....

DATA STUDIOS

.....

[datastudios.org]