The Stack Tax: Why Stitched Together Voice AI Will Never Deliver Agentic CX

Every major CCaaS and conversational AI vendor now claims to offer “agentic” customer experiences. The demos look impressive. A voice agent that reasons, retrieves, decides, and resolves—all in real-time. Then you deploy it. And the magic disappears.

Not because the AI isn’t smart enough. Because the architecture underneath it is held together with API calls, third-party services, and hope.

The Uncomfortable Truth About Composite Stacks

Here’s the claim: if your voice AI stack is assembled from separate best-of-breed components—ASR from one vendor, NLU from another, TTS from a third, dialogue management from a fourth, LLM reasoning bolted on top—you will never deliver a truly agentic customer experience. Not “it’ll be harder.” Never.

The physics of real-time voice conversation don’t bend to accommodate architectural convenience. Three factors guarantee this: latency compounding, noise profile mismatches, and concurrency ceilings. Each one alone degrades the experience. Together, they kill it.

Latency: Death by a Thousand Milliseconds

Human conversation tolerates about 300–400 milliseconds of silence before it feels unnatural. That’s not a design guideline. That’s psycholinguistics. Exceed it, and your caller starts talking over the agent, repeating themselves, or hanging up.

In a fully integrated stack, the round-trip from speech input to spoken response can be optimized end-to-end. Every handoff is internal. Every signal passes through shared memory, not serialized JSON over HTTPS.

In a composite stack, each component boundary introduces latency:

Audio capture to ASR: network hop + buffering
ASR to NLU: serialization, API call, deserialization
NLU to dialogue manager: another API call
Dialogue manager to LLM (if reasoning is needed): yet another
LLM response back through dialogue to TTS: reverse the chain
TTS audio back to caller: final network hop

Each boundary adds 50–150ms under ideal conditions. With six boundaries, you’re looking at 300–900ms of architectural overhead before you’ve done any actual processing. Add the processing time itself—ASR decoding, intent classification, LLM inference—and you’re well past the threshold where conversation feels natural.

This isn’t a tuning problem. It’s structural. You can’t optimize away the speed of light or the overhead of HTTP.

In an integrated stack, these boundaries don’t exist. Components share process space, memory, and context. The result: response times that stay within conversational norms even under complex reasoning loads.

Noise Profiles and VAD: The Misalignment Nobody Talks About

Voice Activity Detection (VAD) is deceptively simple in concept: determine when the caller is speaking and when they’ve stopped. In practice, it’s one of the hardest problems in production voice AI.

Why? Because VAD doesn’t operate in a vacuum. It needs to understand the acoustic environment, the telephony channel characteristics, codec artifacts, background noise patterns, and the speech patterns of the caller. Getting this wrong means either cutting off the caller mid-sentence (false endpoint) or waiting too long after they stop (adding dead air).

When ASR and dialogue management come from different vendors, their noise models are trained on different data, tuned for different environments, and optimized for different objectives. The ASR’s VAD might be aggressive—optimized for quick endpoint detection to minimize processing. The dialogue manager expects complete utterances. The result: the ASR clips the last word of a sentence, the NLU misclassifies the intent, and the agent responds to something the caller didn’t say.

Or the reverse: the ASR’s VAD is conservative, holding the audio stream open to capture every syllable. Meanwhile, the dialogue manager’s barge-in detection interprets the silence as a turn completion and starts generating a response—right as the caller finishes their thought.

These aren’t edge cases. They’re the norm in real-world telephony. Background noise from a car, a call center, a kitchen. Hold music bleeding into the line. Callers who pause mid-sentence to think. Each of these scenarios requires coordinated handling between audio processing, speech recognition, and dialogue logic.

In a unified stack, these components share a single acoustic model and a single decision loop. VAD, endpointing, barge-in, and turn-taking are coordinated decisions, not distributed guesses. In a composite stack, they’re independent systems making independent judgments about the same audio stream—and disagreeing.

Maximum Concurrency: Where Stitched Stacks Break

Most voice AI demos run on single-digit concurrent sessions. Production runs on thousands. This is where composite architectures reveal their most dangerous weakness.

Every external API call in the chain is a concurrency bottleneck. Each vendor component has its own rate limits, scaling characteristics, and failure modes. Your maximum throughput isn’t determined by your best component—it’s determined by your worst.

If your ASR vendor throttles at 500 concurrent streams, your entire system caps at 500—regardless of what your dialogue manager can handle.
If your LLM provider’s inference endpoint has a 2-second p99 latency spike at high load, every conversation slows down.
If your TTS service queues requests during peak hours, callers hear silence while waiting for audio generation.

Scaling a composite stack means scaling every component independently, negotiating capacity with every vendor, and praying they all scale at the same rate. They won’t.

An integrated stack scales as a unit. Capacity planning is singular. Bottleneck analysis is internal. When load increases, every component’s resource allocation adjusts in coordination, not in isolation.

This isn’t theoretical. At Omilia, we process billions of voice interactions across 130+ enterprise customers. We’ve seen what happens when a single external dependency in a chain can’t keep up. The entire experience degrades—not gracefully, but catastrophically. Conversations stall. Callers drop. CSAT collapses.

Why This Matters More for Agentic AI

Traditional IVR-style automation is relatively forgiving of architectural inefficiency. The interactions are short, scripted, and predictable. Latency of 800ms between menu prompts is annoying but survivable.

Agentic AI is fundamentally different. An agentic voice experience involves multi-turn reasoning, context retrieval, conditional logic, real-time decision-making, and natural conversational flow—sometimes across dozens of dialogue turns. The demands on the underlying stack multiply with every turn:

Each reasoning step adds processing time—which compounds on top of architectural latency.
Context from earlier in the conversation must be instantly accessible—not retrieved across API boundaries.
Turn-taking decisions become more complex as conversations deepen—requiring tighter coordination between VAD, dialogue management, and response generation.
Error recovery—mishearing, misunderstanding, course-correcting—must happen in real time, not after a round-trip to three different services.

A composite stack that’s “good enough” for “press 1 for billing” collapses under the demands of a genuine agentic conversation. The tolerances are simply too tight.

The Implication for Buyers

If you’re evaluating conversational AI platforms for agentic customer experiences, the most important question isn’t “which LLM do you use?” or “what’s your NLU accuracy?” It’s this: how much of the stack do you own and control end-to-end?

Ask your vendor:

Does your ASR, NLU, dialogue management, and TTS run in the same process space, or are they separate services?
What is your end-to-end latency under load—not in a demo, but at 10,000 concurrent sessions?
How do your VAD and endpointing models coordinate with your dialogue manager?
What happens to response times when your LLM inference is under peak demand?

If the answers involve words like “partner,” “integration,” or “best-of-breed,” you’re looking at a composite stack. And that means you’re accepting a structural ceiling on how good your agentic experience can ever be.

The Stack Is the Product

The industry is converging on the idea that agentic AI is the future of customer experience. On that, everyone agrees. Where vendors diverge is on whether you can bolt agentic capabilities onto a fragmented architecture—or whether you need to have built the architecture for it from the ground up.

Latency compounds. Noise profiles conflict. Concurrency bottlenecks cascade. These aren’t problems you solve with better prompting or faster models. They’re problems you solve—or don’t—at the architectural level.

In voice AI, the stack isn’t a means to the product. The stack is the product. And if you don’t own it, you don’t control the experience.

About the Author