Can ChatGPT Transcribe Audio and How Does It Work?

Graziano Stefanelli
May 4
2 min read

ChatGPT & Audio Transcription — A Complete Overview

1. What “transcribe audio with ChatGPT” really means

Scenario	Built-in to ChatGPT?	How to do it today
Live dictation (you speak, words appear)	Yes — with the Voice icon in the mobile apps, desktop app, and web	Tap or click Voice, grant mic access, and start talking.
Continuous voice dialogue (“Jarvis mode”)	Yes — Advanced Voice (GPT-4o / 4o-mini) on Plus, Team, and Pro plans; daily preview for Free users	Same Voice icon → pick an output voice; ChatGPT listens and replies aloud in real time.
Upload a prerecorded file (MP3/WAV) for a transcript	No — not in the chat UI	Use the Whisper API or a third-party tool that calls it.

2. Option A — Speak directly in the ChatGPT interface

Platforms: iOS, Android, desktop app, and web.
How it works:
1. Press Voice and start speaking.
2. Standard Voice converts your speech to text (Whisper) before sending it to a GPT model.
3. Advanced Voice feeds raw audio directly into GPT-4o for faster, more natural replies.
Cost & limits: Standard Voice is free; Advanced Voice uses a daily or monthly audio-minute allowance that varies by plan.
Best for: Quick dictation, hands-free Q&A.
Not for: Long recordings—you still can’t attach files in chat.

3. Option B — Whisper API for file transcription

Key facts	Details
Price	About $0.006 per audio minute, billed by the second.
Max upload size	25 MB per request—split larger files and stitch results later.
Languages	50-plus, auto-detected.
Output formats	json, text, srt, and vtt.
Typical uses	Meeting recordings, podcasts, caption generation, searchable archives.

Tip: Pipe the returned text into a ChatGPT completion call if you need summaries, action items, or analysis.

4. Option C — GPT-4o Realtime Audio (developer preview)

What it is: A streaming API that lets you send and receive live audio with sub-second latency.
Indicative pricing: Roughly $0.06 per audio minute in, $0.24 per minute out (token-based, subject to change).
Best for: Voice-first apps, live translation, call-center agents, or anywhere you need conversational AI on the phone.
Status: In preview for selected developers, with broader availability expected later in 2025.

5. Common pitfalls & quick fixes

Pitfall	Fix
Trying to drag-and-drop an audio file into chat	Use Whisper API or a wrapper; the chat UI still rejects audio uploads.
Oversize file errors (HTTP 400)	Compress to ≤25 MB or chunk and reassemble transcripts.
Unexpected Whisper costs	Whisper charges by duration, not file size—trim silence before uploading.
Privacy concerns	Enterprise and regulated customers can enable zero-retention or host an open-source Whisper model on-prem.

6. Quick decision tree

Want instant dictation or voice chat?→ Use the Voice icon inside ChatGPT.
Have a recorded file under 25 MB?→ Call the Whisper API.
Building a fully voice-driven product with live audio?→ Explore the GPT-4o Realtime Audio preview.

__________

ChatGPT itself can’t take audio file uploads yet, but between free in-app dictation, the affordable Whisper API, and the new GPT-4o realtime stack, there’s a solution for nearly every audio-to-text need in 2025.