By People's Voice Editorial·Deep Dive·May 8, 2026 at 2:02 PM

OpenAI Launches Realtime Voice Models For Agents And Translation

1764 words8 min read
OpenAI Launches Realtime Voice Models For Agents And Translation
Photo by Jernej Furman, via Wikimedia Commons (CC BY 2.0)

The release moves OpenAI's voice API deeper into live agents, multilingual support and streaming transcription, but the benchmark claims still need production proof.

SAN FRANCISCO, Calif. OpenAI said Thursday it launched three realtime audio models for developers building voice agents, live speech translation and streaming transcription, expanding the company's push to make voice a production interface for software rather than a demo layer.

The release matters because OpenAI is trying to move voice AI from short call-and-response exchanges into systems that can keep a session open, reason through a request, call tools, translate, transcribe and stream spoken replies while a conversation is still moving. For businesses, the practical question is whether those systems can become reliable and cheap enough for customer support, travel, healthcare administration, education and accessibility.

The Story So Far

OpenAI said the three models are GPT-Realtime-2, GPT-Realtime-Translate and GPT-Realtime-Whisper. The company described GPT-Realtime-2 as its first voice model with GPT-5-class reasoning, GPT-Realtime-Translate as a live interpreter model that translates speech from more than 70 input languages into 13 output languages, and GPT-Realtime-Whisper as a streaming speech-to-text model.

"We’re introducing three audio models in the API that unlock a new class of voice apps for developers." - OpenAI, company announcement

OpenAI's mark supports a release focused on developer platform capability rather than a consumer hardware product. Photo by OpenAI, via Wikimedia Commons (public domain).
OpenAI's mark supports a release focused on developer platform capability rather than a consumer hardware product. Photo by OpenAI, via Wikimedia Commons (public domain).

OpenAI's developer documentation says realtime sessions keep a connection open while an application sends audio, receives events and updates session state. The documentation separates three main paths: voice-agent sessions for assistants that respond and call tools, translation sessions for continuous interpreter behavior, and transcription sessions for streaming transcript deltas without spoken model replies.

That architecture is the technical center of the release. A request-based speech system can handle a file or a short input. A realtime session is meant to manage live audio, interruptions, partial transcripts, tool calls and state updates while the user is still talking.

What's Happening Now

OpenAI says GPT-Realtime-2 can keep a live voice conversation moving while it reasons through a request, calls tools, handles corrections and changes tone. The company said developers can enable short preambles, such as telling a user the system is checking a calendar, so the agent does not appear to stall while it uses external tools.

The company also said Realtime 2 increases the context window from 32K to 128K tokens for longer sessions and more complex workflows. OpenAI's documentation says developers can choose reasoning effort levels of minimal, low, medium, high and xhigh, with low as the default for most production voice agents because latency still matters in spoken interaction.

Transport choices show the intended deployment paths. OpenAI's docs say WebRTC is designed for browser and mobile clients that capture and play audio directly, WebSocket is intended for server media pipelines such as call systems or workers, and SIP is available for telephony voice agents subject to model support.

GPT-Realtime-Translate uses a dedicated translation endpoint rather than the standard voice-agent lifecycle, according to OpenAI's docs. Translation sessions stream audio in and stream translated audio and transcript deltas out, without waiting for the client to commit a normal user turn.

GPT-Realtime-Whisper is narrower but important for workflow products. OpenAI says the model transcribes live speech as people talk, which can support captions, meeting notes, broadcast workflows, classroom tools and call-center systems that need partial text before a full recording is complete.

How The Mechanism Works

The release is not simply a better synthetic voice. It combines native audio models, persistent realtime sessions and developer controls over how an agent behaves while it listens and responds.

A speech waveform illustrates the core problem in realtime voice systems: turning live audio into usable text, actions and replies without breaking the conversation. Photo by Wikimedia Foundation Staff and Contractors, via Wikimedia Commons (CC BY-SA 4.0).
A speech waveform illustrates the core problem in realtime voice systems: turning live audio into usable text, actions and replies without breaking the conversation. Photo by Wikimedia Foundation Staff and Contractors, via Wikimedia Commons (CC BY-SA 4.0).

In a voice-agent session, the application connects to OpenAI's realtime endpoint, streams user audio and listens for model events, tool calls and spoken responses. If the agent needs outside data, such as a calendar, order system or travel reservation, the model can call tools while the session remains active.

In a translation session, the application does not ask the model to create a normal assistant response. OpenAI's docs say the app continuously streams speech into a dedicated translation session and receives translated audio and transcript deltas back. That design is meant for interpreter behavior, not a chatbot turn.

In a transcription session, the application wants text from audio without a model-generated voice. OpenAI's docs say lower delay settings can produce earlier partial text, while higher delay settings can improve transcript quality. That tradeoff means a live caption product, a medical documentation product and a broadcast monitoring product may choose different defaults.

Evaluation Caveats

OpenAI said GPT-Realtime-2 at high reasoning scores 15.2 percent higher than GPT-Realtime-1.5 on Big Bench Audio for audio intelligence. The company also said GPT-Realtime-2 at xhigh reasoning scores 13.8 percent higher on Audio MultiChallenge for instruction following.

Those claims should be read as benchmark claims, not proof that the model will handle every noisy call center, classroom, clinic or airport support desk. Artificial Analysis says Big Bench Audio contains 1,000 audio files drawn from four Big Bench Hard categories: formal fallacies, navigation, object counting and web of lies. The benchmark tests whether native audio models can reason from spoken questions, but it does not cover every production environment.

"Our current Speech to Speech benchmarking evaluates native audio models - models that support native audio input and output - across three quality dimensions: speech reasoning, conversational dynamics and agentic performance." - Artificial Analysis methodology page

OpenAI also cited customer testing from Zillow. In OpenAI's announcement, Josh Weisberg, Zillow's senior vice president and head of AI, said GPT-Realtime-2 produced a 26-point lift in call success rate after prompt optimization, from 69 percent to 95 percent, on Zillow's hardest adversarial benchmark. That is a relevant business signal, but it is still a company-provided customer claim rather than an independently audited study.

"On our hardest adversarial benchmark, this translates to a 26-point lift in call success rate after prompt optimization (95% vs. 69%)." - Josh Weisberg, senior vice president and head of AI at Zillow, in OpenAI's announcement

BolnaAI gave OpenAI another customer claim on translation. In the announcement, Prateek Sachan, BolnaAI's co-founder and chief technology officer, said GPT-Realtime-Translate delivered 12.5 percent lower word error rates than any other model the company tested across Hindi, Tamil and Telugu. That claim points to the market need, especially in multilingual support, but it also depends on the test set, audio conditions and deployment design.

Safety And Policy

Voice agents carry a different risk profile from text chat because they can act while a person is still speaking. If a system can schedule appointments, change travel plans, handle housing requests or support customers in multiple languages, errors can produce immediate operational consequences.

OpenAI says the Realtime API uses active classifiers over sessions and that some conversations can be halted if the system detects harmful-content violations. The company says developers can add their own safety guardrails through the Agents SDK, and its usage policies prohibit spam, deception, malicious cyber activity, weapons development and certain high-stakes decisions without human review.

OpenAI's docs also recommend a safety identifier for realtime requests when an application identifies end users. The company says a stable, privacy-preserving identifier can help target abuse enforcement to an individual user rather than an entire organization.

NIST's AI Risk Management Framework gives the broader enterprise frame. NIST says its framework is intended to help organizations manage risks to individuals, organizations and society and to incorporate trustworthiness considerations into AI design, development, use and evaluation. For voice agents, that means testing real audio conditions, accents, privacy controls, escalation paths and human review before letting the system take consequential actions.

Economic Implications

OpenAI's pricing shows why adoption will depend on workflow value, not novelty. The company says GPT-Realtime-2 costs $32 per 1 million audio input tokens, $0.40 per 1 million cached input tokens and $64 per 1 million audio output tokens. It prices GPT-Realtime-Translate at $0.034 per minute and GPT-Realtime-Whisper at $0.017 per minute.

For a business, those rates have to be compared with labor costs, existing call-center software, translation vendors, captioning tools and the cost of mistakes. A voice agent that resolves routine support calls could justify usage costs if it reduces handle time or expands language coverage. A model that mishears names, mishandles housing rules or fails during noisy calls can erase those savings quickly.

The U.S. platform angle is also significant. OpenAI is competing to make American AI infrastructure the default layer for realtime customer interfaces, travel tools, education products and enterprise workflows. If developers build voice products around OpenAI's transports, pricing model and safety controls, the company gains both usage volume and platform lock-in.

By The Numbers

OpenAI announced three new audio models: GPT-Realtime-2, GPT-Realtime-Translate and GPT-Realtime-Whisper.

GPT-Realtime-Translate supports more than 70 input languages and 13 output languages, according to OpenAI.

GPT-Realtime-2 increases context from 32K to 128K tokens, according to OpenAI.

OpenAI said GPT-Realtime-2 at high reasoning scored 15.2 percent higher than GPT-Realtime-1.5 on Big Bench Audio.

OpenAI priced GPT-Realtime-Translate at $0.034 per minute and GPT-Realtime-Whisper at $0.017 per minute.

What People Are Saying

"GPT-Realtime-2 is built for live voice interactions where the model keeps the conversation moving while it reasons through a request, calls tools, handles corrections or interruptions, and responds in a way that fits the moment." - OpenAI, company announcement

"Realtime sessions keep a connection open while your application sends audio, receives events, and updates session state." - OpenAI developer documentation

"Building voice AI for India means handling diverse regional phonetics." - Prateek Sachan, co-founder and chief technology officer at BolnaAI, in OpenAI's announcement

"The Framework was developed through a consensus-driven, open, transparent, and collaborative process." - National Institute of Standards and Technology, AI Risk Management Framework overview

The Big Picture

OpenAI's release puts more weight on voice as a software interface for work, not just a way to talk to a consumer assistant. The technical bet is that persistent sessions, tool calls, realtime translation and streaming transcription can make spoken interaction useful in workflows where typing is slow or unavailable.

The next tests are practical. Developers will have to measure latency, accuracy, cost, safety handling, language coverage and user trust in real deployments. Benchmark gains and customer pilots can help frame the opportunity, but production audio is where the claims will either hold or break.