
Voice Agent Internals
A voice agent that handles inbound phone calls is doing something a chatbot never has to. It is racing a clock measured in milliseconds, managing a conversation that might run twenty minutes, executing API calls in real time, and maintaining coherent context across dozens of turns — all while sounding like it is paying attention. The voice is what the caller hears. The pipeline is what makes that possible, or causes it to fail.
Modern voice agents are not applications in the traditional sense. They are pipelines — sequences of specialized subsystems, each consuming the output of the previous one, each adding its own latency to a budget that has to stay under roughly two seconds or the conversation starts to feel broken. Understanding how they work requires understanding each component in that chain individually, how they compose, and what happens when one of them degrades. This article walks through the architecture of a production voice agent end to end. The primary lens is inbound telephone agents — where a caller dials a number and a voice agent answers — because those constraints are hardest. Telephony imposes codec limitations, echo characteristics, and infrastructure requirements that web-based voice agents partially sidestep. Web agents and outbound calling agents share the same pipeline with fewer constraints; where they differ meaningfully, those differences are noted.
Voice Agent Architecture
When stripped to essentials, a voice agent is a pipeline with a brain in the middle. On the input side: audio comes in from the caller, passes through signal processing, and converts to text. In the middle: a language model reads that text in context and generates a response. On the output side: that response converts to audio and plays back to the caller. Surrounding this pipeline are guardrails that constrain what the model can say, tools it can call mid-conversation, and an evaluation layer that tracks whether any of it is working.
The pipeline has eight primary stages for an inbound call: audio transport, voice activity detection, acoustic processing, speech-to-text, the language model, text-to-speech, and audio delivery back to the caller. Two additional systems cut across all stages: guardrails applied at the model input and output boundaries, and continuous evaluation applied in production. For complex tasks, a multi-agent orchestration layer sits above the language model stage. For telephone deployments, all of this sits behind a SIP or telephony integration that most voice agent developers never see directly.
Every design decision in a voice agent bends around latency. The round-trip from the moment a caller stops speaking to the moment they hear audio in response — end-to-end latency — has to stay under approximately 1.5 to 2 seconds for conversation to feel natural. Each stage in the pipeline takes a share of that budget. A STT provider that is 300ms slower than an alternative is not a minor inconvenience; it is 300ms subtracted from every other component in the chain. Understanding how each component works requires understanding what it costs in time — and what it costs in dollars.
Audio Transport Determines What the Pipeline Works With
How audio reaches the pipeline depends on where the caller is calling from. For telephone calls, the path runs through PSTN — the traditional phone network — via SIP trunking. SIP (Session Initiation Protocol) is the signaling layer that sets up and tears down calls; it handles the negotiation of call parameters and the routing of the call from the caller’s carrier to the agent’s infrastructure. The actual audio travels separately over RTP (Real-time Transport Protocol), a lightweight UDP-based protocol designed for time-sensitive media where occasional packet loss is preferable to the latency that TCP’s retransmission behavior would introduce.
For telephone calls, the audio codec is almost always PCMU (G.711 μ-law), the standard telephony codec in use since the 1980s. It encodes audio at 64 kbps in 8 kHz mono. Its advantage is universality — every carrier, every PBX, every telephone system understands it. Its limitation is fidelity: G.711 captures human voice adequately but discards frequency content above 4 kHz. That upper range is where consonants like “s,” “f,” and “th” live. Losing it does not make speech unintelligible but it does make speech recognition measurably harder, which is a constraint the rest of the pipeline inherits.
Web-based voice agents use WebRTC instead of SIP. WebRTC is the browser standard for real-time audio and video, and it defaults to Opus — a modern codec that operates from 6 kbps to 510 kbps, adapts to network conditions in real time, and captures frequencies up to 20 kHz. A web agent built on WebRTC starts with significantly better audio quality than a telephony agent built on G.711, which translates directly to lower word error rates in speech recognition downstream. Web agents typically run 3–5 WER points better than telephony agents on identical content for this reason alone.
Between the application and raw telephony infrastructure sits a layer of platforms that handle the integration complexity. Twilio Media Streams exposes phone call audio as a WebSocket stream — the application connects, receives audio frames from the caller, and sends audio back without managing SIP directly. LiveKit provides WebRTC infrastructure optimized for low-latency real-time communication, with built-in participant management, recording, and echo cancellation. Daily.co offers a similar WebRTC layer with more opinionated defaults for voice agent use cases. These platforms abstract the SIP plumbing and codec negotiation, but they do not abstract the latency those layers introduce; each platform adds roughly 50–200ms to the pipeline depending on geographic proximity to the agent’s servers.
Two additional signal concerns appear at this layer. Jitter buffers compensate for network variance: RTP packets arrive at irregular intervals due to network congestion, and the jitter buffer holds incoming packets and releases them at a consistent rate, trading a small amount of additional latency (typically 20–50ms) for smoother audio. Packet loss concealment handles dropped packets by using interpolation to reconstruct missing audio frames. Modern PLC algorithms recover gracefully from 5–10% packet loss rates. Above that, audio quality degrades audibly and speech recognition accuracy falls sharply.
Choosing the Infrastructure Layer
Before the first line of agent logic is written, there is a foundational choice: how much of the stack to own. Voice agent infrastructure spans telephony, audio transport, STT, LLM, and TTS — each individually composable, but requiring significant integration work if assembled from scratch. A growing set of platforms abstract some or all of this, and the right choice depends on where you are in the deployment journey and what you need to control.
Vapi is the most abstracted option. Connect a phone number and an LLM provider and Vapi handles the rest: audio transport, STT, turn management, TTS, and telephony routing. Configuration rather than code. For prototyping a new agent or validating whether a use case works over voice at all, it is the fastest path from idea to a working call. The tradeoffs appear at scale: Vapi’s pipeline decisions are its own, and when latency or cost requirements become specific enough that those decisions matter, the abstraction starts to resist rather than help. Per-minute pricing on a managed platform also carries a margin; at high call volumes, that margin compounds.
LiveKit is the right middle layer for developers who want full pipeline control without rebuilding WebRTC from scratch. LiveKit handles real-time audio transport and session infrastructure; you choose your STT provider, LLM, and TTS, and write the logic that connects them. It is well-suited to web-based voice agents where WebRTC is the natural transport. For telephony, LiveKit requires a SIP bridge, which adds integration complexity. It is open-source, deployable on your own infrastructure, and has a strong ecosystem of voice agent tooling built on top of it.
Twilio Media Streams is the choice when you need carrier-grade telephony reliability and full control over the processing pipeline. Twilio routes and manages the call; Media Streams exposes the audio as a WebSocket stream that your application processes however it needs to. Twilio provides no help with STT, LLM, or TTS — it is purely the transport and telephony layer. Pairing Twilio with a custom processing stack gives maximum component-level control at the cost of owning the integration work.
Daily.co occupies a similar position to LiveKit for web-based agents, with more opinionated defaults for voice agent use cases and a growing set of voice-specific plugins. It is often the choice for developers who want more structure than raw LiveKit without the full abstraction of a managed platform.
The practical sequence for most builders: start with Vapi or an equivalent managed platform to validate the use case without spending time on infrastructure. Move to Twilio or LiveKit when pipeline control, latency requirements, or per-minute economics become constraints the managed platform cannot accommodate. Consider building directly against raw SIP only when there is a specific, quantified reason that raw access provides something the alternatives cannot. Most teams that skip the first step and build from scratch spend months on infrastructure before writing the first line of agent logic.
Voice Activity Detection Is the Pipeline’s Clock
Voice activity detection solves a deceptively simple problem: knowing when the caller is speaking. A voice agent cannot start transcribing silence, and it cannot generate a response before the caller is done. VAD is what decides when each of those things is true, and it makes those decisions 50 times per second.
The fundamental unit of audio processing is the frame — typically 20 milliseconds of audio. Each frame is fed to the VAD in real time. The VAD classifies each frame: speech or not speech. Frames classified as speech are collected and queued for transcription; frames classified as silence trigger endpointing logic that decides when the turn is over.
Early VAD implementations used energy-based detection: if the audio power exceeds a threshold, classify as speech. This works in quiet environments and fails in everything else. Background noise, HVAC, music on hold, a caller who speaks quietly, a caller using speakerphone — all produce audio energy that defeats a threshold-based detector. Modern deployments use ML-based VAD almost exclusively. Silero VAD, an open-source model, is the most widely deployed option in production voice agent stacks. It uses a lightweight convolutional architecture at roughly 50KB — small enough to run on commodity servers with negligible GPU requirements — and achieves above 99% accuracy in clean speech conditions. In noisy environments, accuracy typically stays above 85%. WebRTC VAD, embedded in Google’s WebRTC libraries, uses Gaussian mixture models and is simpler but less accurate than Silero; it ships as part of most WebRTC stacks, which makes it a common default for web agent deployments.
The harder problem is endpointing — determining when the caller is actually done speaking rather than just pausing. A caller who says “I need to reschedule my appointment for...” and pauses to think is not done. The VAD has detected speech, then silence. The endpointing logic has to decide whether to trigger transcription or keep waiting. The simplest approach sets a fixed silence threshold: 800 milliseconds of silence after speech fires the endpointing event. This is too aggressive for thoughtful speakers with complex requests — they get cut off mid-sentence — and too conservative for fast-paced transactional exchanges, where the 800ms pause adds perceivable lag to every turn. Prosody-aware endpointing improves on this by analyzing pitch and duration patterns that signal sentence completion. A falling pitch contour at the end of a clause is a reliable signal that a speaker has finished a thought; rising pitch typically indicates continuation. Most production systems use a tunable silence threshold of 500–700ms with prosodic correction. Getting endpointing wrong is audible and damages caller trust faster than almost any other failure in the pipeline.
Acoustic Processing Happens Before the Words Matter
Three signal processing stages run on incoming audio before any transcription begins. They do not add intelligence. Their job is to deliver audio clean enough that the intelligence downstream can do its work accurately.
Acoustic echo cancellation. When the agent speaks — playing TTS audio back to the caller — that audio exists in the caller’s environment. If the caller’s microphone picks it up (and it will, on speakerphone or any non-headset setup), it enters the pipeline alongside the caller’s voice. The speech recognition system then sees both the caller and an echo of what the agent just said. AEC cancels that echo before the audio reaches the transcription stage. The mechanism is conceptually similar to noise-canceling headphones, but working in reverse: instead of sampling ambient sound and canceling it before it reaches your ears, AEC samples the agent’s own audio output and subtracts it from the microphone signal before the audio reaches transcription. It works by maintaining a reference signal — a clean copy of the audio the agent is playing — and using adaptive filtering to estimate how that reference signal will appear in the microphone input after traveling through the room’s acoustic environment, then subtracting that estimate from the microphone signal. The challenge is that the acoustic environment changes continuously as the caller moves or adjusts their setup; the filter has to track those changes in real time. In good conditions, AEC achieves 20–40 dB of echo suppression. On speakerphone in reverberant rooms, suppression is lower. WebRTC’s AEC3 implementation, updated in 2024, handles these conditions better than its predecessors and ships with most WebRTC stacks.
Noise suppression. Background noise degrades speech recognition in ways that are nonlinear. A constant noise floor is less damaging than intermittent noise events at the same average level, because ASR models trained on clean speech handle predictable background differently from unpredictable interference. ML-based noise suppression — the class of approaches that RNNoise pioneered, now extended by larger models from providers like Krisp and RTX Voice — uses a trained model to separate speech from non-speech frequencies in real time. Modern models achieve 8–10 dB of SNR improvement in typical environments. In high-noise settings like open call centers or callers outdoors, suppression is less complete but still meaningfully improves transcription accuracy downstream.
Automatic gain control. Callers speak at wildly different volumes. A soft-spoken caller with the phone held away from their face and an enthusiastic caller with the phone pressed to their cheek present the transcription system with very different signal levels. AGC normalizes those levels to a target range, typically −20 to −15 dBFS, so the transcription model receives consistently leveled audio regardless of speaker characteristics. The failure mode is amplifying background noise when a caller goes quiet, which is why AGC runs after noise suppression rather than before it. The three stages run in order — AEC, then noise suppression, then AGC — and add approximately 50–150ms of cumulative latency to the pipeline. That latency is not optional; it happens before the audio is clean enough to transcribe accurately.
Speech Recognition Is Fast Enough Until the Line Is Bad
Speech-to-text transcription is the first stage in the pipeline that produces something the language model can read. How fast it does so, and how accurately, determines the quality of everything downstream.
For voice agents, streaming transcription is mandatory. Batch transcription — sending a complete audio recording and receiving a complete transcript — adds 5–30 seconds of latency that destroys conversational naturalness. Streaming transcription processes audio as it arrives, producing partial results continuously and a final result when endpointing fires. The partial results are not discarded: they allow the language model to begin forming a response before the transcript is complete, which is one of the primary techniques for keeping end-to-end latency below perceptible thresholds.
A streaming transcription session works like this. Audio frames arrive at the STT service in chunks of 100–200ms over a WebSocket or gRPC connection. As each chunk arrives, the service updates its internal hypothesis about what was said and returns a partial result — the current best guess at the transcript. “I need to reschedule” becomes “I need to reschedule my appointment” as more audio arrives. When the VAD triggers endpointing, the service finalizes its hypothesis, returns a final result, and closes that turn’s session. The time from the last word spoken to the final result appearing is typically 500ms–1.5 seconds, depending on provider and network.
Provider differences matter because they trade latency against accuracy against cost. Deepgram’s Nova-2 model achieves approximately 5.9% word error rate on clean speech and returns final transcripts in under 500ms, making it the most common choice for latency-sensitive deployments. Google’s Chirp model reaches roughly 4.0% WER with comparable streaming latency. AssemblyAI’s conformer-based model achieves 4.9% WER with better handling of accented speech, at the cost of slightly higher latency. OpenAI’s Whisper model is the most accurate at approximately 4.0% WER but is batch-only — it cannot stream — which makes it unsuitable for real-time conversation. Cartesia’s Sonic STT prioritizes speed above all, targeting sub-200ms returns at the cost of a point or two of accuracy, and is the choice when latency budget is critically constrained.
These benchmarks are measured on clean speech. Telephony changes the picture significantly. G.711 audio at 8kHz, run through a real phone network with typical packet loss and caller-side noise, raises effective WER to 12–18% even with best-in-class models. A 15% WER means roughly one error in every seven words. Errors on function words and filler are usually recoverable — the language model can infer intent from context. Errors on names, numbers, medical terms, account identifiers, and domain-specific vocabulary are not recoverable the same way. Domain adaptation — fine-tuning a model on vocabulary and speech patterns specific to the deployment context — reduces WER by 20–40% in specialized use cases and is not optional for medical, legal, or financial deployments where term accuracy has downstream consequences.
On-device STT is a growing alternative to cloud-based streaming. Smaller Whisper variants in the 100–300MB range can run on server infrastructure with latency competitive to cloud providers while eliminating the network round-trip and the privacy implications of routing caller audio to a third-party service. As of 2025, on-device inference is a realistic option for organizations with sufficient compute or strong data-residency requirements; the quality gap relative to the largest cloud models has narrowed substantially.
The Brain Has a Latency Budget
The language model is the component that gives a voice agent its ability to understand what a caller said, reason about what to do next, and generate a natural-sounding response. It is also the component most at risk of making the agent useless — either because the model is too slow, the prompt is poorly designed, or the model is too large for the latency budget the conversation requires.
Voice agents run different prompts than chatbots. A chatbot prompt can include extensive background, tolerate long responses, and assume the user is reading text on a screen. A voice prompt operates under hard constraints. Responses must be short — under 100 tokens in most exchanges, because spoken language at 150 words per minute does not tolerate a 300-word response the way text does. Responses must be in speech-natural language: no markdown, no bulleted lists, no headers, no parenthetical asides. The agent should say “I can help you reschedule. What date works for you?” — not “Rescheduling Options: Here are the steps I can take.”
A well-structured voice agent system prompt contains a role definition and persona; explicit behavioral rules covering response length, format, and topic scope; tool definitions with parameter schemas for every action the agent can take; a short set of few-shot examples showing correct dialogue patterns; and dynamic context injected at runtime — the caller’s account information, their appointment history, any outstanding issues. That dynamic context is retrieved by the backend before the conversation starts, or fetched via a tool call early in the session. Putting it in the system prompt at invocation time is cleaner than mid-conversation retrieval for information the agent will almost certainly need.
Model selection for voice is primarily a latency decision. A caller pausing while waiting for a response accepts that pause once before noticing it, twice before doubting the call quality, and three times before wondering if the agent is broken. The latency difference between model classes is significant: Gemini Flash generates a short voice response in 200–300 milliseconds; GPT-4o operates at 1000–1500 milliseconds; Claude Opus 4 is closer to 2000–3000 milliseconds for a typical turn. For the transactional exchanges that make up most of a voice agent’s workload — booking appointments, checking account status, answering policy questions — Gemini Flash and GPT-4o mini are nearly indistinguishable in quality from larger models when the system prompt constrains the response domain. The 5–10% quality advantage of a larger model matters significantly less when the task is confirming a reservation than when the task is drafting a legal analysis.
Temperature tuning is different for voice than for text. A temperature of 0.7, standard for creative generation in chat contexts, produces unnecessary variation in voice deployments where consistency builds caller trust. Most production voice agents run at 0.3–0.5. At the lower end, the model is deterministic enough that common questions reliably produce similar responses, which also simplifies evaluation: the same test input produces the same output, making regressions detectable.
Retrieval-augmented generation addresses the gap between what the model knows and what the agent needs to know for a specific caller. Rather than embedding an entire product catalog, policy library, or knowledge base into the system prompt — a token-intensive approach that pushes against context limits and slows inference — RAG retrieves the relevant fragment at query time and injects it into the prompt. The agent fetches the three relevant facts about the caller’s account rather than loading a full account history. RAG also reduces hallucination on factual queries: the model generates from a retrieved fact rather than from its training weights, which do not reflect the current state of a caller’s account or a product’s current pricing.
Guardrails Run Before the Brain and After It
Guardrails are the controls that keep a voice agent within its intended scope. A customer service agent for a software company should not offer medical advice. An appointment booking agent should not make commitments beyond what the scheduling system can fulfill. An agent deployed in a regulated industry should not make statements that could constitute financial guidance. Guardrails enforce those constraints at the boundaries of the language model, not inside it.
Three layers run in a production deployment. The first runs before the model sees the caller’s input. An input classifier — a small, fast model or a rule-based system — evaluates the incoming transcribed text for content that should not reach the language model: attempts to override the agent’s instructions, explicit harmful content, or queries so far outside the agent’s scope that processing them would produce noise rather than value. This classifier needs to run in under 50ms and must have a low false positive rate. In voice, rejecting a valid caller query means dead air or a confused response — the cost of a false positive is different than in text, where a graceful error message is easy to display.
The second layer is prompt architecture rather than a classifier. Never pipe raw user input directly into the system prompt via string concatenation. A caller who says “ignore your previous instructions” should have that text land in the user message role, where the model’s training distinguishes user requests from operator instructions, not embedded in the system prompt where it would carry the weight of operational guidance. This closes the most common prompt injection path without adding latency, because it is an architectural choice rather than a runtime check.
The third layer runs on the model’s output before it reaches text-to-speech. An output validator checks the response for content that should not be spoken: hallucinated policy commitments, statements outside the agent’s authorized scope, language that could create legal exposure. In regulated industries, output validation is mandatory. Guardrails AI provides a pipeline framework for attaching validators to inputs and outputs with a library of pre-built validators for common content categories. Llama Guard, Meta’s open-source safety classifier, identifies harmful content across standard safety taxonomies and runs locally for organizations with data-residency requirements. Custom classifiers — fine-tuned on domain-specific content the deployment must reject — are necessary when the definition of out-of-scope is particular to the business context.
The total latency budget for all guardrail layers combined is approximately 100ms. This constrains choices: larger classifiers with higher capability ceilings may exceed that budget and force a tradeoff with end-to-end latency. The practical result is that voice guardrails favor smaller, faster classifiers and rule-based systems over large models, which inverts the approach that would be taken in a non-latency-constrained context.
Tool Calls Break Conversations Without Async Execution
Most voice agent tasks require doing something beyond generating text. An appointment booking agent must check available calendar slots and create a record. An account inquiry agent must retrieve a live balance. An order tracking agent must query a fulfillment database. These operations happen through tool calls — structured function invocations the language model issues mid-conversation when it needs external data or wants to take a real-world action.
In a synchronous execution model, the tool call occupies the full pipeline. The caller stops speaking. The agent transcribes. The model reads the transcript, determines it needs to call the calendar API, issues the call, waits for the response, reads the result, and generates a reply. The wait is silent. From the caller’s perspective, the agent stopped responding for two to four seconds. That silence reads as technical failure, not as the agent “thinking.” Callers who encounter it repeatedly start speaking into the silence to check if the call dropped.
Async tool execution is the production pattern. When the model determines a tool call is needed, it generates a bridge phrase — “let me check that for you” or “give me just a moment” — and starts the tool call in the background. The bridge phrase goes to TTS and plays to the caller immediately. When the tool returns its result, the model receives it and generates the actual response. The caller hears a natural acknowledgment followed by the answer. The technical pause that was two to four seconds of dead air becomes 0.5 to 1 second of audio before results start arriving. Libraries and orchestration platforms like Vapi, Daily.co, and LiveKit’s agent toolkit support this pattern natively; implementing it from scratch requires careful thread management to avoid race conditions between the in-flight tool call and incoming caller audio.
Parallel tool execution applies when a single caller request requires multiple independent external queries. “Can you pull up my appointment schedule and confirm my insurance is still on file?” requires two API calls, neither of which depends on the other’s output. Running them sequentially doubles the wait. Running them simultaneously — issuing both calls at the same time and collecting results as they arrive — holds total wait time to the duration of the slower call, not the sum of both.
Tool failures need graceful handling that does not break the conversation flow. If a calendar API returns a 503, the agent cannot expose a raw error to the caller. The fallback response — “I’m having some trouble accessing the scheduling system right now. Let me connect you with someone who can help directly” — must be pre-defined, available without a model round-trip, and fast enough to play before the caller notices anything went wrong. Write operations introduce an additional concern: if the agent tells the caller “you’re booked for Thursday at 2pm” before the booking tool call confirms success, and the call then fails, the agent has made a commitment it cannot fulfill. The correct pattern is to confirm intent before issuing the write, then confirm completion after the write returns success before playing the confirmation to the caller.
Complex Tasks Need More Than One Agent
A single language model agent operating from a single system prompt can handle a bounded set of conversational tasks reliably. As task scope expands — more domains, more tool integrations, more decision branches — the single-agent architecture strains in specific ways. System prompts grow unwieldy. The model begins confusing instructions from different sections. Edge cases multiply. Reliability degrades not catastrophically but gradually, in ways that are hard to attribute to a specific cause in production logs.
Multi-agent orchestration addresses this by decomposing voice agent tasks into a hierarchy. The structure mirrors how a well-run call center operates: a general reception layer takes every incoming call, classifies what the caller needs, and routes to the appropriate specialist. The specialist handles the actual conversation without the caller feeling like they were transferred. An orchestrator agent handles the call from start to finish: it listens, classifies intent, and routes to the appropriate specialist. Specialist agents focus on a single domain — booking appointments, retrieving account data, processing payments, troubleshooting a technical issue. Each specialist has a compact, focused system prompt and a narrow set of tools. Focused prompts produce more reliable behavior than prompts trying to cover every case; a booking specialist that knows only scheduling has fewer opportunities to confuse scheduling with billing than a single agent that handles both.
Intent classification at the orchestrator level is typically handled by a small, fast classifier rather than the full conversational model. The orchestrator listens to the caller’s first statement, classifies the request, and routes to the appropriate specialist. The specialist receives a briefing from the orchestrator — the conversation history, the caller’s account context, the classified intent — and continues the conversation from there. The caller never hears a transfer. They do not encounter “let me connect you to our scheduling department.” From the caller’s perspective, the same agent is answering throughout. Maintaining this illusion requires that all agents in the system share a consistent voice and persona, and that the briefing document passed at handoff is complete enough that the receiving agent does not ask questions the caller already answered.
State management in a multi-agent system requires a clear ownership model. The orchestrator owns the session state: caller identity, conversation history, confirmed commitments made during the call. Specialist agents receive a read-only copy of the relevant portion of that state. They do not write back to session state directly; they pass results to the orchestrator, which decides what to persist. This prevents the class of bugs where two agents simultaneously modify the same state object and produce inconsistent behavior. In practice, most voice agent deployments stay under three agent hops per call. Each hop adds 200–500ms of routing and context-passing latency. More than three hops makes the conversation feel disjointed, and it usually indicates the agent’s task scope is too broad rather than that more orchestration is needed.
Text-to-Speech Starts Before the Model Is Done
Text-to-speech is the pipeline’s last major conversion stage: text in, audio out. What the caller hears is not the language model’s output — it is an audio rendering of that output, and the quality of that rendering determines whether the agent sounds like a professional service or an early automated phone tree.
Modern TTS systems are built on neural architectures. The processing chain runs roughly: text input is converted to a phoneme sequence; a duration prediction model determines how long each phoneme should last; a mel-spectrogram generation model produces a time-frequency representation of the intended audio; a neural vocoder converts that spectrogram into a PCM waveform. Each step has been a target of rapid model improvement over the past three years. The duration model is where prosody — the rhythm and timing that makes speech sound natural rather than mechanical — is primarily determined. Models that handle duration poorly produce speech where every sentence sounds equally paced, regardless of its content.
Speech quality is measured by Mean Opinion Score, a listener rating from 1 to 5. Human speech scores approximately 4.9. Modern neural TTS from leading providers scores 4.3–4.8 — perceptibly different from human speech to careful listeners but indistinguishable to most callers in a conversation context, where attention is on content rather than delivery. ElevenLabs occupies the quality end of the market, with extensive voice options, fine-grained control over speaking style, and voice cloning capability, at a cost premium. Cartesia’s Sonic TTS is optimized for latency, targeting sub-120ms first-chunk delivery, making it the choice when latency budget is the primary constraint. OpenAI’s TTS offering trades some quality for tight integration with OpenAI’s API ecosystem. Azure and Google TTS offer strong multilingual coverage and enterprise-grade reliability at competitive pricing.
Voice cloning is available through most providers and is now standard in enterprise voice agent deployments where the agent should embody a specific branded persona. The basic technique is speaker embedding: extract a speaker’s voice characteristics from 30 seconds to two minutes of sample audio, encode those characteristics as a learned vector, and condition the TTS model on that vector during generation. The result sounds like the reference speaker with reasonable fidelity. Fine-tuning — training lightweight adapter layers on a specific speaker’s audio — produces higher quality at the cost of more source material (typically five to thirty minutes of clean recording) and additional preparation time. Organizations using voice cloning in production should maintain documented consent records for any voice samples used. Cloning someone’s voice without consent is subject to increasing legal scrutiny in both US and EU jurisdictions.
The most impactful latency optimization in TTS for voice agents is sentence-level streaming, also called dual-streaming. In a naive implementation, the pipeline waits for the language model to complete its full response before passing any text to TTS, and waits for TTS to finish generating audio before sending any audio to the caller. Both waits are unnecessary. The moment the language model produces a complete first sentence, that sentence can go to TTS. The moment TTS produces the first audio chunk, that audio can start playing to the caller. By the time the language model has finished generating its full response, much of the audio is already playing. This technique reduces perceived end-to-end latency by 1–2 seconds and is now standard in production deployments.
Full-Duplex Is What Natural Conversation Actually Requires
The most visible design choice in voice agent architecture is conversation mode: half-duplex or full-duplex. The difference is whether caller and agent can speak simultaneously.
In half-duplex mode, one party speaks while the other listens, and turns alternate cleanly. The caller speaks; the agent transcribes, processes, and responds; then waits for the caller to speak again. No simultaneous audio, no interruptions. This is the walkie-talkie model. It is common in early voice agent deployments because it is architecturally simpler: without simultaneous audio, echo cancellation is trivial, barge-in detection is unnecessary, and the pipeline can process each turn in isolation. The caller experience is functional but unnatural — it feels like a phone tree with better language understanding, not like a conversation.
Full-duplex allows both parties to speak simultaneously and requires the agent to detect when the caller starts speaking while the agent is still responding, stop the current response, and re-engage. This is how human conversation actually works. A caller who says “actually, never mind” while the agent is mid-sentence expects the agent to stop, not to finish its thought and then acknowledge the interruption 15 seconds later.
The technical challenge in full-duplex is barge-in detection: identifying when caller audio represents a genuine interruption rather than TTS audio bleeding back through the microphone or background noise. A naive energy detector will trigger on the agent’s own TTS output playing in the caller’s environment. Reliable barge-in detection requires AEC to be running effectively — the agent’s TTS output must be subtracted from the microphone signal before barge-in classification runs. After AEC, a classifier trained to distinguish residual caller speech from echo artifacts decides whether the incoming signal represents a real interruption. When barge-in is confirmed, the pipeline cancels the in-flight TTS, flushes the audio buffer, and restarts the listen — transcribe — respond cycle. Latency from barge-in detection to first response must stay under 200ms or the agent’s acknowledgment of the interruption feels delayed.
The transport protocol shapes which mode is practical. WebRTC provides a bidirectional audio channel purpose-built for simultaneous two-way communication and is the natural foundation for full-duplex agents. Traditional PSTN telephony as it flows through SIP integrations is more naturally half-duplex in practice — not because the protocol prohibits full-duplex, but because echo path characteristics over phone networks make barge-in detection less reliable. Web-based voice agents running in a browser over WebRTC achieve full-duplex more reliably than telephony agents for this reason. Most enterprise telephony deployments today use a hybrid approach: half-duplex as the default, with barge-in enabled for specific signals like extended silence after agent speech or specific interrupt words.
WebSocket is the transport alternative for cases where WebRTC’s peer-to-peer connection model is unnecessary or unwanted. WebSocket operates over TCP and traverses firewalls and NATs more easily than WebRTC’s UDP-based connections. The tradeoff is latency: TCP’s congestion control and retransmission behavior adds variability to audio delivery that UDP avoids. For streaming TTS audio from a server to a browser in one direction — where reliability matters and strict real-time is less critical — WebSocket is common. For bidirectional real-time audio where latency variance creates audible artifacts, WebRTC is preferred.
Outbound Agents Call People; the Architecture Is Different
Outbound calling agents initiate calls rather than receiving them. The pipeline components are identical to inbound agents — the same STT, language model, TTS, and guardrail stack — but the call setup, answer handling, and compliance architecture are fundamentally different.
The call setup sequence for outbound adds 5–35 seconds of overhead before any conversation begins. A SIP INVITE is sent to the carrier with the destination number. The carrier routes through PSTN. The destination phone rings. Someone answers — or does not. If no one answers within 30 seconds, the call is treated as a no-answer and logged for retry scheduling. If someone does answer, the agent has the first 500–1000 milliseconds of incoming audio to determine what it is connected to. A human pickup sounds different from a voicemail greeting, and an IVR system sounds different from both. Answer detection classifiers trained on these distinctions achieve 85–95% accuracy in production. Calling into an answering machine when a human was expected — or playing a greeting when a voicemail system is waiting for a tone — is a common failure mode and one of the clearest signals of poor answer detection in deployment logs.
Scheduling and rate limiting are operational concerns that do not exist for inbound agents. SIP trunks typically support around 100 simultaneous calls per trunk at most carriers. Beyond that, calls queue or fail. Outbound campaigns covering large contact lists must spread calls across time windows to stay within trunk capacity, respect time zone constraints, and avoid creating the kind of concentrated call spike that triggers carrier spam detection. TCPA compliance is mandatory for US deployments. The Telephone Consumer Protection Act requires prior express written consent for automated calls using artificial or prerecorded voices, and a TTS-powered voice agent qualifies. Consent must be documented and stored against each recipient’s phone number before dialing. Every outbound call must offer an opt-out mechanism. Calling before 8am or after 9pm local time violates the statute. Violations carry statutory damages of $500–$1,500 per call, and class action exposure for large campaigns is real. For EU deployments, GDPR imposes similar consent requirements with higher potential penalties and stricter limitations on how consent records can be retained.
Long Conversations Break in Predictable Ways
Every voice agent conversation has a token budget. The language model’s context window is finite, and each conversation turn consumes some of it. The accumulation is predictable but often underestimated by developers who prototype on short test conversations. At a typical rate of 200 tokens per conversational exchange, a model with a 128K context window — standard for GPT-4o mini — exhausts that window in approximately 60 exchanges. For a fast-paced customer service call with rapid back-and-forth, that limit arrives sooner than it sounds.
Context window overflow does not cause an error. It causes degradation. As the conversation history approaches the context limit, the model loses access to early turns. A caller who mentioned their account number ten minutes ago finds the agent asking for it again. References to earlier parts of the conversation produce confused responses. The agent begins to contradict what it said earlier, unable to access the turn where it made the commitment. The failure is gradual, which makes it harder to diagnose than a hard error would be.
The standard mitigation is windowed context management with periodic summarization. When the conversation history reaches approximately 80% of the context window, the pipeline summarizes the oldest portion of the conversation and replaces those turns with a compact representation. Extractive summarization keeps key facts from old turns — the caller’s name, the issue they described, commitments made — and discards the conversational filler. Abstractive summarization uses a small, fast model to generate a prose summary of what was discussed. The summary, plus the most recent 10–15 turns, becomes the new context, allowing indefinitely long conversations at the cost of losing verbatim access to early content.
Hallucination drift is a distinct phenomenon that compounds context overflow. Over the course of a long conversation, the model’s hallucination rate increases gradually. Research documents the increase at approximately 0.2–0.5 additional percentage points per 100 turns, against a baseline hallucination rate that depends on model and domain. The mechanism is compounding: an early error in the conversation — a misheard word, a transcription error, an incorrect assumption the model did not challenge — gets incorporated into subsequent responses. Later responses build on that error. By turn 80 or 100, the agent may be operating on a factual premise that entered the conversation as a transcription error in turn 10 and has been silently reinforced ever since.
Mitigations include tool-verified responses for all factual claims — the agent calls the authoritative database and reads back what it finds rather than generating numbers from its weights — and periodic explicit confirmations with the caller. Asking “just to confirm: your account ends in 4821 and you’re calling about the March 12th charge?” at natural breaks in the conversation re-verifies facts and resets the conversational state to a known-good checkpoint. It also gives the caller a natural opportunity to correct errors before they compound further.
Persistent memory across sessions is a related problem. By default, each call starts fresh with no knowledge of prior interactions. A caller who has called three times this week about the same unresolved billing issue should not have to re-explain their situation from the beginning. Production deployments handle this by generating a structured summary at the end of each session — caller identity, topics discussed, outcomes, commitments — and storing it against the caller’s profile. On the next session, the summary loads as initial context, and the agent opens with awareness of what was discussed rather than starting from zero. The summary must be compact enough to fit within the context window alongside the new conversation, which constrains how much historical detail it can carry.
Evaluation Is How You Know Any of This Is Working
A voice agent that sounds good in a demo and performs well in the developer’s tests may be deeply broken in ways that only production data reveals. Evaluation is what closes that gap — the discipline of measuring what the agent is actually doing, at scale, in conditions the development environment did not reproduce.
The foundational metric is task completion rate: the percentage of conversations in which the caller’s intent was successfully fulfilled. A caller who wants to reschedule an appointment and reaches the end of the call with the appointment rescheduled is a completion. A caller who gives up and asks for a human — or hangs up — is not. Measuring task completion requires knowing what the caller intended, which is not always explicit in the transcript. The most scalable approach is LLM-as-judge evaluation: a separate model reads each conversation transcript and determines whether the intent was fulfilled, using a rubric calibrated on human-labeled examples. Well-designed voice agents in 2025 achieve 75–85% task completion rate in production; first-deployment agents without significant iteration typically run 60–70%.
Latency metrics operate at two levels. End-to-end latency — from the moment the caller stops speaking to the moment audio starts playing — should stay below 1.5 seconds for most exchanges and below 2 seconds in all but exceptional cases. Time to first audio byte is a more sensitive measure of perceived responsiveness, because audio that starts quickly and continues smoothly feels faster than the same total latency delivered as silence followed by a burst. Both should be measured in production, not just in testing environments, because network conditions, server load, and real caller audio quality produce distributions that controlled tests do not replicate.
Word error rate in production is measured differently than in benchmarks. Benchmark WER is computed on clean speech from specific accent distributions. Production WER depends on actual callers, actual phone equipment, actual background noise, and actual topic distributions. A monitoring system that periodically samples live transcripts, runs them through a higher-accuracy reference transcription, and computes the delta gives a continuous read on STT degradation. STT quality degradation is rarely signaled by an error in the application logs; it just produces quietly wrong transcripts that the model processes with full confidence.
Hallucination rate measurement is the hardest evaluation problem because hallucinations are only identifiable when there is ground truth to check against. For factual claims that are tool-verifiable — account balances, appointment dates, policy terms — the pipeline can check its own output by re-verifying the claim against the authoritative source before the TTS plays it. For claims generated from the model’s own knowledge, human evaluation on sampled conversations is the only reliable method. Sample size matters: a 1% hallucination rate in a million-call deployment is 10,000 callers receiving incorrect information per month.
Escalation rate — the percentage of calls where the caller requests or is routed to a human agent — is a proxy for capability gaps. A 15–20% escalation rate is normal and expected; some callers will always prefer human contact regardless of agent quality. An escalation rate above 25–30% typically indicates either that the agent’s task scope is too narrow or that it is failing to resolve requests it should handle. Tracking escalation reasons — caller explicitly requested human, agent offered escalation after tool failure, agent hit a policy boundary — enables targeted improvement rather than undifferentiated tuning.
Automated test harnesses run synthetic conversations through the full pipeline before deployments and configuration changes. They generate representative scenarios covering common caller intents, edge cases, guardrail test cases, and tool failure conditions, drive those scenarios through the agent with synthesized speech, and evaluate outputs against expected behavior. This does not replace production evaluation, but it catches regressions in guardrail handling, tool call correctness, and response accuracy before they reach callers. The combination of automated pre-deployment testing and continuous production monitoring — with alerting thresholds for WER degradation, tool failure spikes, escalation rate changes, and TTS quality drops — is what separates voice agents that improve over time from agents that silently degrade.
What Each Component Costs per Minute
Every component in the pipeline runs on metered infrastructure, and the costs compound per minute of active conversation. Understanding the cost structure before going to scale prevents the billing-cycle surprise that catches most first-time voice agent builders, because the economics of a small prototype do not predict the economics of ten thousand calls per day.
The pipeline has four billable layers for a cloud-based deployment. Telephony or transport: inbound call minutes from the carrier or platform. Twilio charges approximately $0.0085 per minute for inbound calls; outbound runs around $0.013 per minute. Managed platforms like Vapi bundle telephony into a per-minute rate that covers the full stack. Speech-to-text: billed per minute of audio transcribed. Deepgram Nova-2 runs approximately $0.0043 per minute; Google Chirp runs $0.006 per minute; AssemblyAI runs higher at around $0.015 per minute for streaming. These scale linearly with call duration. Language model: billed per token processed. Cost scales with conversation length and context window usage. A typical ten-minute call with thirty exchanges generates roughly 60,000 tokens across all turns. At Gemini Flash pricing, that call costs under a cent in LLM inference; at GPT-4o pricing, the same call costs around $0.15. Text-to-speech: billed per character of text converted to audio. A voice agent speaking at a natural conversational pace generates approximately 700–900 characters per minute. Cartesia Sonic runs approximately $0.065 per thousand characters; OpenAI TTS runs $0.015 per thousand characters; ElevenLabs is plan-priced with effective per-character rates that typically run higher than the alternatives.
The total cost for a minute of conversation ranges from roughly $0.02 on a cost-optimized stack (Deepgram + Gemini Flash + Cartesia + Twilio) to $0.25–0.40 on a premium stack (Google Chirp + GPT-4o + ElevenLabs + Twilio). A ten-minute call on the premium stack costs $2.50–4.00 in inference costs alone, before telephony or platform overhead. At one thousand calls per day, the difference between the two stacks is approximately $200 versus $2,000 per day in API costs.
The model choice — LLM and TTS specifically — is the biggest cost lever. Most voice agent quality problems come from system prompt design, guardrail configuration, and tool reliability rather than from model capability. The quality gap between Gemini Flash and GPT-4o for a well-scoped appointment booking agent is small. The cost gap is not. Upgrading the model before optimizing the prompt and the pipeline is the most expensive way to address the wrong problem. Start with the fastest, cheapest stack that meets the quality bar; upgrade specific components only when measured evidence shows a specific component is the limiting factor.
A voice agent is a pipeline, and every component in that pipeline has a failure mode that becomes visible at scale. The teams shipping voice agents that hold up under real conversational load are not winning on any single component. They win on integration discipline: understanding what each stage contributes to the latency budget, where cost concentrates, and how each stage degrades when the inputs it depends on get worse. The stack is learnable. What takes time is learning it through calls, not through documentation.
Sources
- Deepgram, Nova-2 Speech Recognition Model Technical Brief, 2025
- Google DeepMind, Chirp: Universal Speech Model Technical Report, 2024
- Silero Team, Silero VAD: Pre-trained Enterprise-Grade Voice Activity Detector, GitHub, 2024
- Meta AI, Llama Guard: LLM-Based Input-Output Safeguard for Human-AI Conversations, arXiv:2312.06674, 2024
- WebRTC Working Group, WebRTC 1.0: Real-Time Communication Between Browsers, W3C, 2023
- ElevenLabs, Eleven Turbo v2.5: Low-Latency Text-to-Speech Technical Documentation, 2025
- Cartesia, Sonic: A Model for Real-Time Speech Synthesis, Technical Documentation, 2025
- Twilio, Media Streams Developer Documentation and Pricing, 2025
- LiveKit, Voice Agent Framework Documentation, 2025
- Vapi, Voice AI Infrastructure Technical Documentation, 2025
- Federal Communications Commission, TCPA Rules and Regulations, 47 U.S.C. § 227, 2025
- AssemblyAI, Conformer-2 Speech Recognition Technical Overview, 2025
