Can ChatGPT Transcribe Audio and How Does It Work?
- Graziano Stefanelli
- May 4
- 2 min read

ChatGPT & Audio Transcription — A Complete Overview
1. What “transcribe audio with ChatGPT” really means
Scenario | Built-in to ChatGPT? | How to do it today |
Live dictation (you speak, words appear) | Yes — with the Voice icon in the mobile apps, desktop app, and web | Tap or click Voice, grant mic access, and start talking. |
Continuous voice dialogue (“Jarvis mode”) | Yes — Advanced Voice (GPT-4o / 4o-mini) on Plus, Team, and Pro plans; daily preview for Free users | Same Voice icon → pick an output voice; ChatGPT listens and replies aloud in real time. |
Upload a prerecorded file (MP3/WAV) for a transcript | No — not in the chat UI | Use the Whisper API or a third-party tool that calls it. |
2. Option A — Speak directly in the ChatGPT interface
Platforms: iOS, Android, desktop app, and web.
How it works:
Press Voice and start speaking.
Standard Voice converts your speech to text (Whisper) before sending it to a GPT model.
Advanced Voice feeds raw audio directly into GPT-4o for faster, more natural replies.
Cost & limits: Standard Voice is free; Advanced Voice uses a daily or monthly audio-minute allowance that varies by plan.
Best for: Quick dictation, hands-free Q&A.
Not for: Long recordings—you still can’t attach files in chat.
3. Option B — Whisper API for file transcription

Key facts | Details |
Price | About $0.006 per audio minute, billed by the second. |
Max upload size | 25 MB per request—split larger files and stitch results later. |
Languages | 50-plus, auto-detected. |
Output formats | json, text, srt, and vtt. |
Typical uses | Meeting recordings, podcasts, caption generation, searchable archives. |
Tip: Pipe the returned text into a ChatGPT completion call if you need summaries, action items, or analysis.
4. Option C — GPT-4o Realtime Audio (developer preview)
What it is: A streaming API that lets you send and receive live audio with sub-second latency.
Indicative pricing: Roughly $0.06 per audio minute in, $0.24 per minute out (token-based, subject to change).
Best for: Voice-first apps, live translation, call-center agents, or anywhere you need conversational AI on the phone.
Status: In preview for selected developers, with broader availability expected later in 2025.
5. Common pitfalls & quick fixes
Pitfall | Fix |
Trying to drag-and-drop an audio file into chat | Use Whisper API or a wrapper; the chat UI still rejects audio uploads. |
Oversize file errors (HTTP 400) | Compress to ≤25 MB or chunk and reassemble transcripts. |
Unexpected Whisper costs | Whisper charges by duration, not file size—trim silence before uploading. |
Privacy concerns | Enterprise and regulated customers can enable zero-retention or host an open-source Whisper model on-prem. |
6. Quick decision tree
Want instant dictation or voice chat?→ Use the Voice icon inside ChatGPT.
Have a recorded file under 25 MB?→ Call the Whisper API.
Building a fully voice-driven product with live audio?→ Explore the GPT-4o Realtime Audio preview.
__________
ChatGPT itself can’t take audio file uploads yet, but between free in-app dictation, the affordable Whisper API, and the new GPT-4o realtime stack, there’s a solution for nearly every audio-to-text need in 2025.