Microsoft Copilot & PDFs: a Technical Deep Dive
- Graziano Stefanelli
- Apr 29
- 3 min read

Copilot reads selectable-text PDFs, chunks them, indexes them, and answers with GPT-4-Turbo.
Edge, OneDrive, Chat, Studio, and Sustainability Manager offer PDF features with size limits.
Small PDFs (<20 pages) yield better answers; scanned files need OCR first.
Data stays private via Microsoft Graph, Dataverse, and Azure AI Search.
Best practices: ensure text access, split big files, prompt clearly, and disable web search.
1. How the Pipeline Works (end-to-end)
Accessible-text gate check – Copilot will only ingest PDFs that already contain machine-readable text; image-only scans must be OCR’d first.
Parsing & normalisation – the PDF stream is split into logical “chunks” (± 4 k tokens ≈ 2-3 pages each) so that every chunk fits inside the GPT-4-Turbo context window while preserving section structure. Copilot’s public guidance caps effective context at ≈ 7 500 words for Q-and-A and ≈ 15 000 words for rewriting.
Semantic indexing – every chunk is embedded and stored in a Microsoft Graph-backed semantic index (the same index that powers Microsoft Search).
Retrieval-augmented generation (RAG) – a user prompt or follow-up query is expanded with graph signals (file name, author, sharing context), the most relevant PDF chunks are retrieved, and the final answer is produced by an Azure OpenAI GPT-4-Turbo endpoint with on-your-data grounding.
Response streaming & chat memory – the answer is streamed back to the client and—if enterprise data protection (EDP) is enabled—both prompt and response are retained under the tenant’s existing retention policies.
2. Platform-by-Platform Capabilities
Where you invoke Copilot | What you can do with a PDF | Key limits | Notes / extras |
OneDrive Web | Generate single- or multi-file summaries, then drill down with follow-up questions. | Any supported file type ≤ 10 MB (trial 1 MB) per upload; daily quota 2 GB. | Works without opening the document; supports batch mode. |
Microsoft Edge sidepane | Read a PDF in the browser and ask Copilot to summarise, explain tables, translate passages, etc. | Practical best-practice: split files > 50 pages for faster answers. | Uses the same semantic index as M365, so browser context (URL, title) is automatically injected. |
Copilot Chat (M365 app, web, mobile) | Drag-and-drop a PDF into chat, then ask analytic or transformation prompts. | Hard ceiling ≈ 1.5 M words / 3 000 pages; ideal size < 7 500 words for chat; per-file size ≤ 24 MB. | Uploaded files are stored in user-scoped OneDrive for Business and are never used for training. |
Copilot Studio (custom agents) | Upload PDFs as knowledge sources; agents answer questions or power chatbots. | ≤ 512 MB per PDF, ≤ 500 files; SharePoint connectors up to 200 MB per file.* | Files live in Dataverse; vector search + grounding handled automatically. |
Sustainability Manager | ESG-specific “Document analysis” lets users upload up to 5 PDFs and interrogate them in natural language. | ≤ 3 MB per file (preview). | Answers are stored alongside ESG metrics for traceability. |
*SharePoint knowledge sources inherit the 200 MB limit unless Enhanced Search is enabled (then 512 MB).
3. Practical Limits & Performance Guidance
Context window realities – although Word can theoretically summarise 1.5 M words, response accuracy drops sharply beyond 15 k words; splitting large PDFs into thematic sections yields better answers.
File-size ceilings – Edge and OneDrive do not impose a strict MB limit, but upload endpoints do (10 MB per file for most licensed tenants).
Daily throughput – OneDrive AI requests share a 2 GB daily ingestion pool per user; Studio uploads are capped at 500 files per agent.
Scans vs. text-PDFs – Copilot ignores bitmap-only pages; run OCR first to avoid silent omissions.
4. Security, Compliance, Residency
Enterprise data protection (EDP) encrypts prompts & completions at rest, aligns retention with Microsoft 365 policies, and blocks training use.
EU Data Boundary (EUDB) – for EEA tenants, inference traffic and storage stay inside EU regions; web-search calls can be disabled by policy.
Dataverse storage – files uploaded via Copilot Studio live in Dataverse file capacity (default 3 GB per environment) and are deletable by admins at any time.
5. Inside the Box — Key Components
Microsoft Graph supplies identity, sharing, and access-control context so the LLM can filter out content the user is not permitted to see.
Azure AI Search (vector + lexical) handles chunk retrieval with hybrid ranking before the prompt is sent to the model.
Azure OpenAI GPT-4 Turbo generates prose; multi-modal extensions (“Vision”) are rolling out to allow in-PDF image understanding in a future update.
6. Practitioner Checklist
▸ Make text selectable. OCR your scans.
▸ Stay under the sweet spots. < 20 pages for iterative chat; split anything longer.
▸ Use explicit, granular prompts. Ask for section-by-section analysis, tables only, etc.
▸ Leverage follow-up. Copilot keeps PDF context alive for the entire chat thread.
▸ Control exposure. Disable web search when handling confidential PDFs.
▸ For custom agents, pre-tag your files. Consistent filenames and SharePoint metadata improve retrieval ranking.

