top of page

Microsoft Copilot & PDFs: a Technical Deep Dive

ree
Copilot reads selectable-text PDFs, chunks them, indexes them, and answers with GPT-4-Turbo.
Edge, OneDrive, Chat, Studio, and Sustainability Manager offer PDF features with size limits.
Small PDFs (<20 pages) yield better answers; scanned files need OCR first.
Data stays private via Microsoft Graph, Dataverse, and Azure AI Search.
Best practices: ensure text access, split big files, prompt clearly, and disable web search.

1. How the Pipeline Works (end-to-end)

  1. Accessible-text gate check – Copilot will only ingest PDFs that already contain machine-readable text; image-only scans must be OCR’d first.

  2. Parsing & normalisation – the PDF stream is split into logical “chunks” (± 4 k tokens ≈ 2-3 pages each) so that every chunk fits inside the GPT-4-Turbo context window while preserving section structure. Copilot’s public guidance caps effective context at ≈ 7 500 words for Q-and-A and ≈ 15 000 words for rewriting.

  3. Semantic indexing – every chunk is embedded and stored in a Microsoft Graph-backed semantic index (the same index that powers Microsoft Search).

  4. Retrieval-augmented generation (RAG) – a user prompt or follow-up query is expanded with graph signals (file name, author, sharing context), the most relevant PDF chunks are retrieved, and the final answer is produced by an Azure OpenAI GPT-4-Turbo endpoint with on-your-data grounding.

  5. Response streaming & chat memory – the answer is streamed back to the client and—if enterprise data protection (EDP) is enabled—both prompt and response are retained under the tenant’s existing retention policies.


2. Platform-by-Platform Capabilities

Where you invoke Copilot

What you can do with a PDF

Key limits

Notes / extras

OneDrive Web

Generate single- or multi-file summaries, then drill down with follow-up questions.

Any supported file type ≤ 10 MB (trial 1 MB) per upload; daily quota 2 GB.

Works without opening the document; supports batch mode.

Microsoft Edge sidepane

Read a PDF in the browser and ask Copilot to summarise, explain tables, translate passages, etc.

Practical best-practice: split files > 50 pages for faster answers.

Uses the same semantic index as M365, so browser context (URL, title) is automatically injected.

Copilot Chat (M365 app, web, mobile)

Drag-and-drop a PDF into chat, then ask analytic or transformation prompts.

Hard ceiling ≈ 1.5 M words / 3 000 pages; ideal size < 7 500 words for chat; per-file size ≤ 24 MB.

Uploaded files are stored in user-scoped OneDrive for Business and are never used for training.

Copilot Studio (custom agents)

Upload PDFs as knowledge sources; agents answer questions or power chatbots.

≤ 512 MB per PDF, ≤ 500 files; SharePoint connectors up to 200 MB per file.*

Files live in Dataverse; vector search + grounding handled automatically.

Sustainability Manager

ESG-specific “Document analysis” lets users upload up to 5 PDFs and interrogate them in natural language.

≤ 3 MB per file (preview).

Answers are stored alongside ESG metrics for traceability.

*SharePoint knowledge sources inherit the 200 MB limit unless Enhanced Search is enabled (then 512 MB).


3. Practical Limits & Performance Guidance

  • Context window realities – although Word can theoretically summarise 1.5 M words, response accuracy drops sharply beyond 15 k words; splitting large PDFs into thematic sections yields better answers.

  • File-size ceilings – Edge and OneDrive do not impose a strict MB limit, but upload endpoints do (10 MB per file for most licensed tenants).

  • Daily throughput – OneDrive AI requests share a 2 GB daily ingestion pool per user; Studio uploads are capped at 500 files per agent.

  • Scans vs. text-PDFs – Copilot ignores bitmap-only pages; run OCR first to avoid silent omissions.


4. Security, Compliance, Residency

  • Enterprise data protection (EDP) encrypts prompts & completions at rest, aligns retention with Microsoft 365 policies, and blocks training use.

  • EU Data Boundary (EUDB) – for EEA tenants, inference traffic and storage stay inside EU regions; web-search calls can be disabled by policy.

  • Dataverse storage – files uploaded via Copilot Studio live in Dataverse file capacity (default 3 GB per environment) and are deletable by admins at any time.


5. Inside the Box — Key Components

  • Microsoft Graph supplies identity, sharing, and access-control context so the LLM can filter out content the user is not permitted to see.

  • Azure AI Search (vector + lexical) handles chunk retrieval with hybrid ranking before the prompt is sent to the model.

  • Azure OpenAI GPT-4 Turbo generates prose; multi-modal extensions (“Vision”) are rolling out to allow in-PDF image understanding in a future update.


6. Practitioner Checklist

Make text selectable. OCR your scans.

Stay under the sweet spots. < 20 pages for iterative chat; split anything longer.

Use explicit, granular prompts. Ask for section-by-section analysis, tables only, etc.

Leverage follow-up. Copilot keeps PDF context alive for the entire chat thread.

Control exposure. Disable web search when handling confidential PDFs.

For custom agents, pre-tag your files. Consistent filenames and SharePoint metadata improve retrieval ranking.

bottom of page