ChatGPT File Upload and Reading Capabilities: Full Report on File Types, Supported Formats, Processing Methods, Practical Applications, Use Cases, Limitations, and Technical Insights

Graziano Stefanelli
1 day ago
16 min read

Let's explore how ChatGPT ingests, parses, and processes different file types—including documents, spreadsheets, images, presentations, and code—covering the full range of supported formats, the underlying technical architecture, memory and tokenization strategies, best practices, current limitations, and advanced configuration across web, API, and custom GPT workflows.

ChatGPT also allows users to upload various file types so the model can “read” and process their content. This turns this chatbot from a simple Q-and-A bot into a powerful analysis assistant on your own documents, spreadsheets, images, and code.

Supported formats include text documents (PDF, Word, TXT, Markdown, RTF, etc.), spreadsheets (Excel .xlsx or CSV), slide decks (PowerPoint), images (PNG, JPEG, GIF, etc.), and code or data files (Python, JavaScript, JSON, etc.).

In practice, you can attach multiple files (up to 10 at once in a chat) and ask ChatGPT to summarize, analyze, or transform their contents.

INDEX

________________

INTRODUCTORY TABLE

	Extensions / Formats	Capabilities (ChatGPT can…)
Documents	PDF, .docx, .txt, .md, .rtf, .tex	extract and summarize text; answer questions about content; translate or reformat text. (Scanned PDFs or images require OCR.)
Spreadsheets	.xlsx, .xls, .csv	analyze data with the Advanced Data Analysis mode: run calculations, create tables or charts, find patterns, and explain results.
Presentations	.pptx	extract and summarize text from slides; provide feedback on structure or style. (Embedded images or charts are generally ignored outside Enterprise.)
Images	.png, .jpg / .jpeg, .gif, .bmp, .webp	describe image content and perform OCR on text within images. (For PDFs with images, only Enterprise can analyze charts or figures.)
Code Files	.py, .js, .java, .cpp, .cs, .html, .css, .ts, .sh, .php, .rb, etc.	review, explain, or debug code; convert between languages; suggest improvements. (All code must be in text encoding like UTF-8.)
Data / JSON	.json	parse structured data; convert to tables or CSV; analyze schema or contents (often via the Data Analysis mode).

_________________

1 Uploading and Using Files

In the ChatGPT web or mobile app (Plus and Enterprise versions), you’ll see a file-attachment icon ( "+" and then "Add photos and files") in the chat interface. Click it to select and upload a file from your device. After uploading, the file appears in the chat history (often as a link or file preview). You can then ask ChatGPT questions about that file—for example, “Please summarize the attached PDF,” or “Using the attached Excel sheet, what was the highest sales month?” ChatGPT will process the file content and respond accordingly. You can even upload multiple files in one conversation (up to ten) to have the model compare or cross-reference them.

Free-tier users now have limited file-upload access (approximately three files per day with strict caps). ChatGPT Plus subscribers (USD 20 / month) can upload many more files (officially up to eighty every three hours on GPT-4o) and enjoy faster, prioritized processing. A new ChatGPT

Pro tier (USD 200 / month) and Enterprise accounts offer even higher or “unlimited” quotas. When uploading, ensure you’re using the GPT-4 model (sometimes labeled Advanced Data Analysis or GPT-4V / Vision); older models don’t have file tools. ChatGPT uploads work on web, iOS, and Android; API support is coming.

_________________

2 Core Capabilities

Once a file is uploaded, ChatGPT can perform many tasks on its contents. For text documents it can read and summarize the text, answer questions about it, rewrite it, translate it, or extract key points. You can upload a long legal contract and ask ChatGPT to highlight all clauses about termination, or feed it a research-paper PDF and ask for a plain-language summary. The model can search within the text for specific topics (for example, “Find all mentions of liability in the attached PDF.”) and pull out structured data such as dates, names, or figures. If the file contains tables (like a PDF financial report), ChatGPT can interpret the tables, though very complex or unusual layouts may confuse it.

Spreadsheets and CSV files get special handling through Advanced Data Analysis: ChatGPT loads the data into a Python-like environment, examines the columns, and can run code to compute statistics or generate charts. You might ask it to calculate sums, averages, or plot quarterly revenue. ChatGPT can process multiple CSV / XLSX files together and identify patterns across them.

For images, the vision model can analyze visual content. Upload a photo or diagram and ChatGPT can describe the image, read any text (OCR), and answer questions. Stand-alone image files (PNG, JPG, etc.) use GPT-4’s vision capability, whereas text PDFs / Word docs rely on text extraction.

ChatGPT also handles code and structured data files intelligently. Upload a .py, .js, .java, or similar file and ask the model to explain what it does, identify bugs, or suggest improvements. Similarly, JSON or other data-format files can be parsed and transformed.

_________________

3 Example Use Cases

Financial analysis: Upload an Excel sheet of quarterly earnings and ask ChatGPT to compute key ratios, highlight outliers, or generate revenue-vs-expense charts. You could also upload a PDF earnings report and have ChatGPT summarize the conclusions.
Research and education: Feed in scientific papers or textbooks (PDFs) and request summaries, highlights of novel contributions, or explanations of complex passages. Teachers can upload lecture notes or problem sets and ask ChatGPT to generate practice questions or simpler explanations.
Legal and compliance: Upload contracts or policy documents to extract important clauses, such as penalties for early termination, or compare two versions of a legal document and highlight differences.
Code development: Developers can upload code files or snippets and ask, “What does this function do? Find errors.” ChatGPT can walk through the logic, suggest fixes, or even write additional code.
General insights: Business users might upload market-research PDFs and ask for executive summaries; HR teams can upload resumes or job descriptions and have ChatGPT extract key skills or rewrite them in a different style.

_________________

4 Limitations

File size and length are capped: each file can be at most 512 MB, with stricter limits for certain types (spreadsheets around 50 MB or roughly 2 million tokens of text). Images are capped at 20 MB. Very large combined content may exceed the model’s context window, so excess text could be ignored. There are also total storage and usage caps; free users have very limited daily quotas, and even paid users face per-user or per-organization storage limits.

Some formats aren’t supported. ChatGPT cannot upload video or audio files, archives like ZIP / RAR, or password-protected documents. Google Docs links must be exported to Word or PDF first. Images embedded within documents (except on Enterprise) are ignored—only raw text is read. Accuracy is another limitation: the AI may misread complex tables or OCR text, or hallucinate content that isn’t present. Always verify critical facts or calculations against the source.

Uploaded files live only in the current chat session. If you start a new chat or switch models, the model “forgets” the file. (Custom bots can upload persistent knowledge files, but that’s an advanced workflow.)

_________________

5 Tips for Best Results

Prepare your content and prompts carefully. Reference the file explicitly: “In the attached PDF report, what were last year’s total expenses?” If a document has multiple sections, first ask ChatGPT to outline the structure, then query a specific section by its title. For data files, ensure columns are well-labeled and data is clean. Use high-quality scans for OCR.

For complex tasks, break your work into steps—identify relevant parts, then have ChatGPT perform calculations or write a summary. You can also ask it to verify its own work: “Double-check that the total you found matches the sum of column B.” Always treat ChatGPT’s output as a draft or assistant; for critical tasks, have a human expert review the results.

_________________

6 Processing of Major File Types

Text Documents (Word, PDF, TXT, etc.)

ChatGPT supports text-based documents (e.g. .txt, .md, .docx, text-based PDF). Upon upload, these files are parsed by a document-extraction pipeline. For PDFs, ChatGPT typically uses a PDF text extractor (similar to libraries like PyPDF2 or MuPDF) to pull out all digital text; scanned pages or embedded images are not read except in Enterprise Vision Retrieval mode. Word (.docx) or text files are similarly converted to raw text. That extracted text is subject to a hard cap of 2 million tokens per file. Once extracted, the text is placed into the model context as usual.

Tokenization: The extracted text is split into tokens by the GPT tokenizer (which handles prose, markup, and some rich text like lists). GPT’s tokenizer includes many special tokens for punctuation, code, and common words.
Large documents: If a document is extremely long, ChatGPT may internally chunk it. For example, custom GPTs with knowledge retrieval automatically break files into semantic chunks and embed them in a vector store. Even in ordinary chats, very long texts will be handled by summarization or retrieval under the hood to stay within context limits.
Preprocessing: Minimal additional preprocessing is applied (aside from text extraction). Structural elements like headings, lists or table layouts remain as plain text (GPT will see “Table:” or bullet points if extracted). ChatGPT does not preserve PDF layout or images in normal plans.
Model pipeline: This text goes into the same language model pipeline as user messages. For GPT-4o, it uses the 128K-token context. If the Advanced Data Analysis (ADA) tool is enabled, ChatGPT might also treat very large tables or semi-structured text via Python (see below), but simple document summarization or Q&A is done by the language model alone.

In summary, text documents are converted to plaintext tokens and processed by GPT-4o’s text understanding. Formatting cues (like lists or simple tables) may be recognized and reproduced, but complex formatting (fonts, exact layout) is not preserved.

Spreadsheets (CSV, Excel)

Spreadsheets and CSVs are treated as structured tabular data. When uploaded, ChatGPT’s code-interpreter/ADA tool usually takes over:

Extraction: The file is parsed using standard data libraries (e.g. Python’s csv or pandas.read_excel). The first few rows may be previewed to infer column types.
Tokenization: While the raw content of a CSV is text, ChatGPT does not simply dump millions of rows into the context. Instead, as explained in the Data Analysis guide, GPT-4 writes and executes Python code to answer queries about the data. In practice, this means the file is loaded into an internal Python environment (the “kernel”), and GPT issues code (via the ADA tool) to manipulate the data.
Structured data handling: The model interprets columns, types, and rows. It can perform computations, summary statistics, or create plots. The assistant might either respond in text (“The average sales is…”) or return results as downloadable CSV/Excel if requested.
Formatting and charts: If the user asks for a chart, the ADA tool can generate plots (matplotlib, etc.) and return them as images. The original spreadsheet’s charts, however, are not automatically preserved; only the underlying data is used.
Token limits: Spreadsheets have a file-size cap (~50 MB) but no strict token cap, since the data lives in the Python tool environment rather than the language context. Still, extremely large tables may cause memory/timeouts; in practice, several million cells might be handled if they fit in memory.

Thus, spreadsheets are largely handled by the ADA “Python kernel”: ChatGPT treats the data as code-executable content rather than raw text. This allows maintaining structure (e.g. tables remain tables internally) and performing reliable computations. The user sees responses incorporating the data, and can request further analysis iteratively.

Presentations (PowerPoint, Google Slides)

Presentation files (e.g. .pptx) are processed similarly to documents:

Text extraction: The text content from slides (titles, bullet points, notes) is extracted (e.g. via python-pptx or similar). Non-text elements (images, embedded charts) are generally ignored unless Enterprise Vision Retrieval is enabled.
Tokenization: Extracted slide text is tokenized as normal. ChatGPT preserves list/bullet structure (for example, bullets become - or 1. items in output). But intricate formatting (font styles, slide layout) is lost.
Processing: The model can analyze or summarize the slide content. For example, it can critique a slide deck’s content or convert slides into a narrative document.
Data/charts: Embedded charts or images on slides are not processed under normal modes (only their caption or title text might be read). Users would need to separately upload images if analysis of a chart is needed.

Presentations are effectively treated as multi-page text docs. The new file-capability explicitly calls out turning a presentation into a document as a use-case. In all cases the final input to the model is tokenized text.

Images (JPEG, PNG, GIF, etc.)

ChatGPT’s image handling is powered by its vision encoder (GPT-4o Vision). When an image file is uploaded (e.g. via the image upload button), the following occurs:

Preprocessing: The image is resized if needed (e.g. smallest side to 768 pixels, as ChatGPT web does) and divided into tiles (typically 512×512 pixels tiles).
Tokenization: Each tile is fed through the vision encoder, which converts visual features into a series of tokens (vector embeddings). In GPT-4o, each 512×512 tile effectively costs 170 tokens (plus an 85-token overhead for a low-res thumbnail). This is equivalent to roughly 200–250 English words of content per tile.
Model pipeline: The sequence of image tokens is then concatenated with any textual prompt and fed into GPT-4o’s transformer layers. GPT-4o has been trained to align visual features with language, so it can describe, analyze, or reason about the image content in text. For example, it can caption the scene, read text from the image, identify objects, or interpret diagrams.
Vision features: GPT-4o’s vision can perform OCR on visible text in images (e.g. reading a screenshot or sign), recognize objects and scenes, infer simple math or diagrams, etc. However, text in images may be imperfectly recognized if the resolution or quality is low. Complex layouts (e.g. a detailed chart) will be interpreted as best it can, but the output is natural-language descriptions.
Limitations: Non-textual detail like subtle colors or intricate graphics are only indirectly conveyed. If exact visual fidelity is needed (e.g. “What are the exact values of this chart?”), GPT-4o’s output may be approximate. The chat interface can also display the image so the user sees it, but the assistant’s responses are text descriptions. (ChatGPT Vision in Enterprise can annotate PDF images via Visual Retrieval, but standard ChatGPT discards images embedded in PDFs.)

In summary, uploaded image files are processed by GPT-4o’s vision encoder: they become vision-language tokens and passed through the model alongside any textual prompt. No textual equivalent is created unless the image is OCR’d. The result is akin to “the model sees the image and answers as if describing it in words”.

Code Files (Python, JS, etc.)

Source code files are treated as text by default, but with some special considerations:

Plain-text parsing: A code file (e.g. .py, .js, .html) is extracted as text. The code’s content is tokenized by the same GPT tokenizer, which includes many programming language tokens. The model has been trained on code, so it can usually understand syntax and semantics.
Language awareness: When analyzing code text, ChatGPT’s language model switches to a more code-savvy mode if prompted (e.g. it recognizes indentation, keywords, etc.). It can perform code review, generate explanations, or even write code diffs.
Execution (ADA): If the Code Interpreter (Advanced Data Analysis) tool is enabled, ChatGPT can execute code. For example, uploading a .py file could allow the assistant to load and run it (via the Python sandbox) when responding. This can help with debugging or data analysis scripts. If no execution is needed, it still reasons about the code logically.
Tokenization: Code is tokenized similarly to text, but code tokens (identifiers, symbols) often map to multiple sub-tokens. Long code files can quickly hit token limits; the 128K context of GPT-4o is shared among all messages, so extremely long code may need trimming or summarization.
Formatting fidelity: Code structure (indentation, line breaks) is preserved in text. The assistant will usually output code in markdown code blocks if asked. However, binary or compiled code (e.g. .exe, .class) cannot be read directly.

Thus, code files become part of the conversation context as code text. With ADA, ChatGPT may load them into an isolated environment and use them in analysis. Otherwise it uses them as context for the language model to discuss or transform.

JSON and Data Files

Generic data files like JSON, XML, or custom data formats are handled like text or via code:

Parsing: ChatGPT can parse JSON or XML strings if asked. When you upload a .json file, it’s treated as a text document. GPT-4’s language understanding allows it to recognize JSON structure (brackets, keys, arrays). It can extract values, summarize contents, or convert formats.
In ADA: The Python tool can load JSON/CSV easily (e.g. json.load() or pandas.read_json). This gives precise access to structured fields for querying. The model might automatically use pandas if it senses tabular structure.
Tokenization: The raw JSON is tokenized character-by-character (so each key and symbol is separate tokens). Very large JSON could hit token limits; in ADA mode the file may be read fully into memory and manipulated there instead of tokenized in prompt.

In effect, structured data formats are just text to the model, but are also ripe for programmatic processing via Python. ChatGPT can answer questions like “what is the value of key X in this JSON” either by reading tokens or by running code.

_________________

7 Structured Data and Formatting Fidelity

ChatGPT’s handling of tables, lists, and charts depends on the mode:

Tables and lists: Plain text tables or markdown lists in input are usually understood and echoed with similar structure. For spreadsheets, when needed, it may output data as CSV or markdown tables. In code outputs, tables often come out as markdown tables for readability. However, ChatGPT does not maintain layout; it may normalize spacing and alignment.
Charts and figures: ChatGPT itself cannot preserve visual charts from an uploaded file. If the user requests chart output, ADA can generate new charts (as static images) from data. If the user uploads an image of a chart, GPT-4o will describe it qualitatively (“a bar chart showing sales increases”) but cannot extract raw data values unless they are easily readable. The model outputs charts as images only when generating them via code (not returning the original chart image).
Plain text vs. OCR: If a chart or table is in an image or scanned PDF, GPT-4o’s vision may do some OCR (for text in the image) or pattern recognition (for bars/lines). But by default ChatGPT does not run OCR on scanned PDFs; it requires plain digital text. In Enterprise, Visual Retrieval can actually OCR and parse images in PDFs, but in Plus/Free, image-embedded text is ignored. Therefore, for scanned data one must often manually convert to text or use external OCR before uploading.

So we can say that textual structure is preserved as much as possible (lists stay lists, code as code blocks), but visual fidelity is limited to what the model can re-create or describe in text.

_________________

8 Web Interface vs. API vs. Custom GPTs

The behavior of file processing differs slightly depending on how ChatGPT is accessed:

Web ChatGPT (chat.openai.com): The user interface allows drag-and-drop or the “+” button to upload files. These files go through the internal ChatGPT pipeline (using GPT-4o if available) and appear in the chat history. Web Chat supports up to 20 files in a GPT’s configuration or 10 per conversation. Files in a conversation persist until the chat is deleted (with up to 30 days retention). ChatGPT Plus/Pro users can use all advanced tools (vision, ADA) directly in the web chat, whereas Free users have limited tool access (e.g. Free may default to GPT-4o mini with no file upload after limits).
ChatGPT Assistants API: Unlike the chat completion API, the Assistants API does support file uploads as of early 2024. The workflow is (1) upload file via POST /files with purpose=assistants; (2) create an assistant with retrieval tooling; (3) create a thread and attach file by file_ids to a message; (4) run the assistant. The assistant then automatically uses retrieval over that file as knowledge (via vector embeddings). For images, the Assistant API similarly lets you upload an image file and then reference its file_id in the message (rather than base64). In sum, the Assistants API treats files much like custom GPTs do: as indexed knowledge with RAG.
Custom GPTs (GPT Builder): Users on Plus/Pro can build Custom GPTs and upload knowledge files to them. The platform automatically enables Retrieval Augmented Generation (RAG) on these files. This means the GPT breaks uploaded docs into chunks, embeds them, and at runtime retrieves relevant chunks to answer queries. Custom GPTs can use GPT-4o (the full model) or GPT-4o-mini (lighter) as their engine; to use file uploads/vision you must choose a model with advanced tools enabled. Files added as “knowledge” become a persistent part of that GPT (up to 20 files per GPT) and are retained until the GPT is deleted. The user can also choose to enable Code Interpreter for custom GPTs, allowing the GPT to execute code on the uploaded data.

In all cases, the underlying models (GPT-4o base, GPT-4o mini, etc.) remain the same or similar. The major differences are in tooling:

Web chat integrates ADA and vision seamlessly for the user.
Assistants API requires explicit steps (file upload, embedding retrieval tools, thread runs).
Custom GPTs have retrieval turned on by default for knowledge files.
Models: GPT-4o (128K context, multimodal) supports all file modes; GPT-4o mini (also 128K but limited tools) cannot use code interpreter or vision.

Finally, memory/storage: Files uploaded in a normal chat belong only to that chat. Once the chat or user account is deleted, files are removed within 30 days. In a custom GPT, files are part of the GPT’s “knowledge base” and are deleted only when the GPT is. Both web and API have user/org storage caps (10GB per user, 100GB per org).

_________________

9 Constraints and Limits

File size: All individual files are limited to 512 MB. In practice, most large documents (PDFs, Word) far exceed token limits long before this size; text and doc files hit a 2-million-token cap. Spreadsheets have a practical ~50 MB size limit (to avoid millions of rows). Images are capped at 20 MB.
Daily/Hourly quotas: Free users may only upload ~3 files per day, while GPT-4o users can upload up to 80 files per 3-hour window. These limits may be lowered during peak load.
Token context: GPT-4o has a 128K token context. Uploaded text counts against the context. For very large files, ChatGPT relies on RAG or ADA to avoid blowing through the context limit.
Model differences: Only GPT-4o (the “o1” model) supports file uploads fully; GPT-4o mini and older GPT-3.5 models do not have data tools or vision. Thus, some file types (images, spreadsheets) require the full GPT-4o family.
Latency: File processing adds overhead. Large PDFs or many images can slow responses. ADA computations (especially data analysis) incur multi-second delays as Python code runs.
Output validation: ChatGPT’s answers may sometimes hallucinate from file content. It tends to quote or paraphrase loaded text, but always double-check critical outputs. The interface may show “>_” icons you can click to see the code the model ran.

_________________

10 Modules and Subsystems

Internally, ChatGPT invokes specialized modules per file type:

Vision Encoder (Image Model): For images and vision tasks, GPT-4o uses a frozen vision encoder (often likened to a CLIP-like network or a Flamingo-style perceiver) that turns pixels into token embeddings. This encoder processes the thumbnail (85 tokens) and tiles (170 tokens each) for each image. The resulting vectors feed into the language transformer. This is the same subsystem that powers GPT-4 Vision API.
Code Interpreter (Advanced Data Analysis): For spreadsheets and when ADA is enabled, ChatGPT spawns a secure Python execution environment. The model generates code (Python + libraries like pandas/numpy) to answer questions. The Code Interpreter both reads uploaded files (CSV, Excel, JSON, even zipped collections) and writes output files. The language model queries this environment in a loop (generate code → execute → parse result).
RAG/Vector Store: For long documents and knowledge files in custom GPTs, OpenAI uses semantic search. Each file is chunked (e.g. by paragraphs or logical segments) and embedded with OpenAI’s text embeddings. At query time, the user’s prompt is also embedded, and the most relevant chunks are retrieved and prefixed to the model’s input. This lets ChatGPT “search” within your files conceptually. (This retrieval happens automatically in GPT builder workflows.)
Tokenizer: All content—text, code, or text-extracted data—is ultimately tokenized by GPT-4o’s tokenizer (a Byte-Pair Encoding scheme). Images bypass the text tokenizer since they come in as vision tokens.
Memory (Reference Chat History): Separately from files, ChatGPT’s new memory feature (Plus/Pro only) can persist user preferences or facts across sessions. This is distinct from file uploads, but it influences how much context is available in conversation.

_________________

11 Reliability and Edge Cases

Malformed files: If a file is corrupted or unsupported, ChatGPT will usually reply that it “cannot open” or “access” it. For example, a truly binary PDF might not parse. In some cases, splitting or zipping parts of large files helps bypass interface limits.
Poor scan / OCR failures: Low-quality images or PDFs with bad scans yield poor OCR; GPT-4o may misread text or say “I can’t read text clearly.” It's best to upload high-contrast, clear images for text extraction. Handwritten text is hit-or-miss.
Fallbacks: If ChatGPT’s vision model can’t confidently interpret something, it might ask clarifying questions or give a vague answer. For example, “the image is not clear” or “I see text but parts are blurry.” The assistant may also revert to looking only at alt-text or captions if embedded images.
Content type recognition: ChatGPT generally auto-detects file type by extension. If you force a code file as .txt, it still works. Custom GPTs allow specifying “document” vs. “spreadsheet” vs. “presentation” which can nudge the model to use appropriate processing.
Output validation: The interface often highlights when it’s quoting file text. Users should verify quotations and data extractions. Code Interpreter typically returns actual values from the data, which tends to be reliable, but double-check statistical outputs.

Summary of Key Points

File parsing: Text/docs→text extraction; Spreadsheets→code-interpreter; Presentations→text extraction; Images→vision encoder; Code/JSON→text or code parsing.

Tools invoked: ADA Python kernel for data; Vision model for images; GPT’s language model for everything.

Tokenization: Visual content is converted to token embeddings (85 + 170 tokens per image); textual content uses GPT’s text tokens (with a 128K context max).

Structured data: Tables and lists are understood as such in text; spreadsheets stay structured in code form. Charts/figures are recreated (if generated) or described (if input as image).

OCR: Native OCR is only via GPT-4o Vision on images. Scanned docs without Enterprise Visual Retrieval yield no text from images.

Web vs API vs GPT: Web chat uses GPT-4o/4.1 and has built-in ADA/vision; API requires Assistants endpoints for files; custom GPTs use retrieval and specified models. Plus/Pro have more tools (ADA, memory, connectors) and higher quotas.

Limits: 512MB/file; 2M token cap on text docs; ~50MB on sheets; 20MB on images; per-user caps (10GB). Free users very limited (few files/day, possible GPT-4o mini fallback).

Modules: GPT-4o model, vision encoder, ADA Python, RAG vector store.

_________ FOLLOW US FOR MORE.