OpenAI’s May 7 Voice Stack: How GPT‑Realtime‑2 Pricing Changes AI Phone Agent Economics
OpenAI’s May 7 launch put speech at pennies per minute, but the real cost driver for AI phone agents is reasoning tokens and AHT. Here’s how to cap reasoning, parallelize tools, and budget context so your cost per resolved call pencils—while lifting conversion and CX.
Hadi Sharifi
Founder & CEO
OpenAI just made voice I/O cheap enough to run at scale. But if you don’t control reasoning tokens and average handle time (AHT), your cost per resolved call still won’t pencil.
- Key takeaways
- OpenAI’s May 7 pricing pushes speech to pennies per minute, but reasoning tokens and AHT dominate unit costs.
- Design agents to cap reasoning depth, parallelize tools, and budget context growth to avoid runaway spend.
- Expect conversion lifts like early users report; Zillow cites a 26‑point jump in call success (95% vs 69%), but only if flows are instrumented for speed and resolution.
- Realtime‑2’s controllability (preambles, tool transparency) and multilingual modes enable SLAs without bloating costs.
What actually changed on May 7
OpenAI launched GPT‑Realtime‑2 and new audio models priced at $32 per 1M audio input tokens and $64 per 1M audio output tokens, with transcription at $0.017/min and translation at $0.034/min. Source: https://openai.com/index/advancing-voice-intelligence-with-new-models-in-the-api/
The headline: speech I/O is now priced like a utility. Early users are reporting real business impact—Zillow cites a 26‑point lift in call success (95% vs 69%) on real customer interactions—suggesting that well‑designed agents can both reduce AHT and raise resolution quality.
The fine print: the audio numbers are not the full story. The dominant cost driver becomes text‑reasoning tokens (the model’s thinking, planning, and tool‑calling). If you don’t engineer the flow, the model will burn tokens while the customer waits—stretching AHT and spiking costs.
The unit economics that matter
Price per minute looks friendly, but CFOs care about cost per resolved call and downstream revenue impact. A simple framing:
- Cost per resolved call ≈ (Audio I/O + Reasoning tokens + Tool/API calls + Escalations) / Resolutions
- Resolution rate and first‑call resolution (FCR) push revenue up; AHT and reasoning tokens push cost up.
Audio is now a rounding error for many use cases. At $0.017/min for transcription and $0.034/min for translation, six minutes of mono speech is measured in dimes, not dollars. Even audio token pricing at $32/$64 per 1M is structurally low compared with complex reasoning spend. The trap is invisible: repeated chain‑of‑thought behavior, verbose confirmations, and serial tool calls can multiply the token bill and add 45–120 seconds per call.
Non‑obvious insight: the fastest agent is not always the cheapest—unless it thinks efficiently. Speed via unbounded reasoning can quietly exceed the audio savings you just captured. The discipline is to design for minimum necessary reasoning per turn, not maximum perceived intelligence.
Make AI phone agents pencil: 6 design levers
- Cap reasoning depth by intent and risk tier
- Use Realtime‑2 preambles to enforce succinct reasoning: “Decide in ≤2 steps; ask ≤1 clarifying question; prefer action.”
- Set explicit budgets per turn: maximum tokens, maximum tool calls, and a hard timeout (e.g., 450–800 ms) before escalating to a cached fallback or a narrower tool.
- For high‑risk intents (payments, cancellations), allow deeper reasoning but gate with a cost/latency budget and a supervision check.
- Parallelize and prefetch tools
- Replace serial tool calls with parallel execution: identity check + inventory lookup + policy validation in one beat. Merge results at the agent layer.
- Prefetch obvious data on call start (customer profile, order status) to save 1–2 round‑trips per conversation—often 8–15 seconds of AHT.
- Budget context growth
- Summarize every 2–3 turns into a compact state object (facts, intent, next action). Drop raw transcript after summarization.
- Keep tool results, not verbose explanations, in memory. The agent can regenerate language from structured state if needed.
- Cap total context tokens and rotate older state into a “case memo” the agent can query on demand.
- Enforce tool transparency and observability
- Use Realtime‑2’s tool transparency model: emit tool_start/tool_end events, with per‑tool token/time budgets and failure handlers.
- Log token spend by intent, by tool, by turn. Alert when a flow exceeds its budget or when retries >1.
- Optimize turn‑taking and barge‑in
- Encourage shorter user/agent turns. Configure a 250–300 ms barge‑in to cut talk‑over and wasted speech tokens.
- Ban pleasantry loops. Close with a single confirmation and an action receipt, not a recap monologue.
- Multilingual pricing discipline
- Auto‑detect language at call start. Use $0.034/min translation only when source and target languages differ; default to $0.017/min transcription otherwise.
- For bilingual callers, hold a single language per call to avoid double translation passes.
Blueprint example: inbound auto‑parts calls that convert
Consider an auto‑parts seller handling 2,000 inbound calls/day for fitment, price, and availability. The goals: keep cost per resolved call under $0.90, cut AHT from 5:40 to ≤3:30, and lift conversion.
Flow design (leveraging GPT‑Realtime‑2):
- Preamble constrains behavior: “Greet in 3–5 words. Ask 1 clarifying question. Prefer action. No recaps >8 words.”
- Turn‑1 data prefetch: customer lookup (CRM), recent carts, top SKUs for make/model/year. Parallel calls; 700 ms budget.
- Fitment/price tools: call a parts catalog and pricing engine in parallel. Prefer structured outputs; keep raw reasoning off the transcript.
- Multilingual mode: language detect in the first 2 seconds. Use translation only if the agent and caller speak different languages.
- Resolution close: confirm, collect payment or send cart link, and offer an install guide by SMS. Single‑sentence farewell.
Observed impact (pattern from similar deployments):
- AHT cut by 30–40% (prefetch + parallel tools + barge‑in) without harming CSAT.
- Audio cost: pennies per call; controlled reasoning keeps the total under target.
- Revenue lift: higher quote‑to‑order conversion from faster, accurate fitment confirmation.
Tie‑in to operations: a salvage yard using a similar pattern can surface real‑time VIN/part matches, push dynamic prices, and route only edge cases to human specialists. That keeps staffing focused on bottom‑line tasks (sourcing, high‑value negotiations) instead of routine availability checks.
Instrumentation and SLAs you should enforce
- Golden metrics: Cost per resolved call; AHT p50/p90; first‑call resolution; abandonment rate; tool latency p95; token spend per resolution; resolution‑to‑revenue conversion.
- Budgets by intent: example—inventory inquiry (≤1 tool round, ≤2 turns), account update (≤2 tools, ≤3 turns), payment (≤3 tools, ≤4 turns, mandatory confirmation).
- Guardrails: escalate when tool_chain_retry > 1, when reasoning tokens per turn exceed budget, or when the timer hits 1.5× p95 for that intent.
- Quality loops: sample 2–5% of calls for supervision, but review only those with budget violations or negative outcomes. Don’t waste humans on green calls.
Conclusion
Voice is affordable now. The winners will treat reasoning and AHT like line items—not mysteries. Blueprint your flows with hard budgets, parallel tools, preambles that constrain verbosity, and multilingual modes used only when necessary. That’s how you get higher resolution rates and predictable unit costs.
If you want a blueprint that locks SLAs and keeps per‑resolution economics under control, stand up an AI voice agent with a measured rollout and the right budgets. Start a scoped engagement here: Work with Niotex.

Hadi Sharifi
Founder & CEO
Related articles
Hadi Sharifi · May 10, 2026
Hadi Sharifi · May 10, 2026
Hadi Sharifi · May 10, 2026