Microsoft Copilot & PDFs: a Technical Deep Dive

Copilot reads selectable-text PDFs, chunks them, indexes them, and answers with GPT-4-Turbo.

Edge, OneDrive, Chat, Studio, and Sustainability Manager offer PDF features with size limits.

Small PDFs (<20 pages) yield better answers; scanned files need OCR first.

Data stays private via Microsoft Graph, Dataverse, and Azure AI Search.

Best practices: ensure text access, split big files, prompt clearly, and disable web search.

Accessible-text gate check – Copilot will only ingest PDFs that already contain machine-readable text; image-only scans must be OCR’d first.
Parsing & normalisation – the PDF stream is split into logical “chunks” (± 4 k tokens ≈ 2-3 pages each) so that every chunk fits inside the GPT-4-Turbo context window while preserving section structure. Copilot’s public guidance caps effective context at ≈ 7 500 words for Q-and-A and ≈ 15 000 words for rewriting.
Semantic indexing – every chunk is embedded and stored in a Microsoft Graph-backed semantic index (the same index that powers Microsoft Search).
Retrieval-augmented generation (RAG) – a user prompt or follow-up query is expanded with graph signals (file name, author, sharing context), the most relevant PDF chunks are retrieved, and the final answer is produced by an Azure OpenAI GPT-4-Turbo endpoint with on-your-data grounding.
Response streaming & chat memory – the answer is streamed back to the client and—if enterprise data protection (EDP) is enabled—both prompt and response are retained under the tenant’s existing retention policies.

Where you invoke Copilot	What you can do with a PDF	Key limits	Notes / extras
OneDrive Web	Generate single- or multi-file summaries, then drill down with follow-up questions.	Any supported file type ≤ 10 MB (trial 1 MB) per upload; daily quota 2 GB.	Works without opening the document; supports batch mode.
Microsoft Edge sidepane	Read a PDF in the browser and ask Copilot to summarise, explain tables, translate passages, etc.	Practical best-practice: split files > 50 pages for faster answers.	Uses the same semantic index as M365, so browser context (URL, title) is automatically injected.
Copilot Chat (M365 app, web, mobile)	Drag-and-drop a PDF into chat, then ask analytic or transformation prompts.	Hard ceiling ≈ 1.5 M words / 3 000 pages; ideal size < 7 500 words for chat; per-file size ≤ 24 MB.	Uploaded files are stored in user-scoped OneDrive for Business and are never used for training.
Copilot Studio (custom agents)	Upload PDFs as knowledge sources; agents answer questions or power chatbots.	≤ 512 MB per PDF, ≤ 500 files; SharePoint connectors up to 200 MB per file.*	Files live in Dataverse; vector search + grounding handled automatically.
Sustainability Manager	ESG-specific “Document analysis” lets users upload up to 5 PDFs and interrogate them in natural language.	≤ 3 MB per file (preview).	Answers are stored alongside ESG metrics for traceability.

*SharePoint knowledge sources inherit the 200 MB limit unless Enhanced Search is enabled (then 512 MB).

Context window realities – although Word can theoretically summarise 1.5 M words, response accuracy drops sharply beyond 15 k words; splitting large PDFs into thematic sections yields better answers.
File-size ceilings – Edge and OneDrive do not impose a strict MB limit, but upload endpoints do (10 MB per file for most licensed tenants).
Daily throughput – OneDrive AI requests share a 2 GB daily ingestion pool per user; Studio uploads are capped at 500 files per agent.
Scans vs. text-PDFs – Copilot ignores bitmap-only pages; run OCR first to avoid silent omissions.

Enterprise data protection (EDP) encrypts prompts & completions at rest, aligns retention with Microsoft 365 policies, and blocks training use.
EU Data Boundary (EUDB) – for EEA tenants, inference traffic and storage stay inside EU regions; web-search calls can be disabled by policy.
Dataverse storage – files uploaded via Copilot Studio live in Dataverse file capacity (default 3 GB per environment) and are deletable by admins at any time.

Microsoft Graph supplies identity, sharing, and access-control context so the LLM can filter out content the user is not permitted to see.
Azure AI Search (vector + lexical) handles chunk retrieval with hybrid ranking before the prompt is sent to the model.
Azure OpenAI GPT-4 Turbo generates prose; multi-modal extensions (“Vision”) are rolling out to allow in-PDF image understanding in a future update.

▸ Make text selectable. OCR your scans.

▸ Stay under the sweet spots. < 20 pages for iterative chat; split anything longer.

▸ Use explicit, granular prompts. Ask for section-by-section analysis, tables only, etc.

▸ Leverage follow-up. Copilot keeps PDF context alive for the entire chat thread.

▸ Control exposure. Disable web search when handling confidential PDFs.

▸ For custom agents, pre-tag your files. Consistent filenames and SharePoint metadata improve retrieval ranking.