Voice AI sub-800 ms: what we learned building an assistant that sounds human

A human waits, on average, 300 to 500 milliseconds between the end of your sentence and the start of their reply. Above one second, it gets uncomfortable. Above three, the conversation is broken — the person hangs up, or worse, talks over you and everything falls apart.

Most production voice AI assistants in 2026 run between 1.5 and 2.5 seconds end-to-end (Twilio Engineering, 2026). That's the "works without tuning" baseline. We benchmarked three third-party assistants earlier this year and saw it firsthand. Users hang up. SLAs blow. It's depressing.

We managed to drop to 620 ms median on a client pilot two months ago. Not because we found some magic component — there isn't one. But because we refused to run things in series when we could stream them in parallel.

Here's what we learned, with real numbers and real traps.

The problem isn't a slow component

Before architecture, let's lay out the absolute numbers, because it's important to realize that individual components are very fast in 2026.

Best-in-class STT today is AssemblyAI Universal-Streaming at 90 ms first-word, or Deepgram Nova-3 Flux around 150 ms (Future AGI benchmark 2026). TTS: ElevenLabs Flash at 75 ms first-byte, Deepgram Aura-2 at 100 ms. A fast LLM like Claude Haiku 4.5 or GPT-4o mini emits its first token around 400-500 ms.

Sum these naively — wait for STT, then wait for LLM, then wait for TTS — and you mechanically land at 1.2 seconds minimum. And that's without counting end-of-turn detection, typically the long pole of the latency (200 to 500 ms).

That's why everyone struggles to drop under one second without rethinking the flow.

The stream-everything architecture

Here's the architecture we deploy. It has an internal name ("stream-everything") and a principle: nobody waits for anyone.

   PSTN/SIP                         Orchestrator                       Output
   ──────                           ─────────────                      ──────

  Caller ──audio chunks──►  ┌──────────────────┐  ──partial txt──►  ┌─────┐
                            │  Twilio Media    │                    │ STT │
                            │  Streams (WS)    │  ◄──audio──        │     │
                            └──────────────────┘                    └─────┘
                                     │                                  │
                                     │   audio chunks 20ms              │ partial transcript stream
                                     │                                  │
                                     ▼                                  ▼
                            ┌──────────────────┐  ──streamed tokens─►  ┌─────┐
                            │  Agent Loop      │                       │ LLM │
                            │  (state machine) │  ◄──tool calls──      │     │
                            └──────────────────┘                       └─────┘
                                     │                                   │
                                     │   token-by-token                  │ streaming response
                                     ▼                                   │
                            ┌──────────────────┐  ──audio chunks─────────┘
                            │  TTS Streaming   │  (ElevenLabs / Aura)
                            │  + jitter buffer │
                            └──────────────────┘
                                     │
                                     ▼
                                  Caller

Four operating principles behind it.

First principle: STT never stops

The agent should never wait for transcription to finish. Good providers — Deepgram, AssemblyAI — expose an is_final flag that marks stable segments. As soon as a segment is stable, you forward it to your agent loop. When speech_final arrives (end-of-turn detected), you commit the turn and trigger the LLM.

const dg = deepgram.listen.live({
  model: "nova-3-flux",
  endpointing: 300,
  interim_results: true,
});

dg.on("Results", (data) => {
  const transcript = data.channel.alternatives[0];
  if (transcript.is_final) {
    agentLoop.appendStableInput(transcript.transcript);
    if (data.speech_final) {
      agentLoop.commitTurn();
    }
  }
});

Three lines of logic, but it changes everything. We regularly see teams wait for the "final transcript" before kicking off the LLM, losing 500 ms per turn for no reason.

Second principle: TTS starts on the first token

When your LLM emits its first token, you send it straight to TTS. ElevenLabs streaming and Deepgram Aura support token-level audio chunking. The person on the line hears the first syllable of the reply while the LLM is still generating the end of the sentence. It feels a bit magic the first time you see it work.

Third principle: end-of-turn detection, calibrated to the use case

The long pole, almost always, is how long your system waits before deciding the person is done talking. Defaults are tuned very conservatively — 700 to 900 ms of silence — to avoid false positives. Too conservative for a real conversation.

By scoping the use case, you can drop to 250-400 ms without degrading UX. If you build an inbound lead qualifier, users respond in short sentences — 300 ms is enough. For an appointment-booking assistant where the user mulls over slots, 400 ms. For a tier-2 tech support where the user structures complex questions, 600 ms.

The secret is not taking the default, and tuning per use case.

Fourth principle: colocation actually matters

Every 50 ms of network hop counts triple because it applies on the path audio caller → orchestrator → LLM → TTS → audio caller. If you run your orchestrator in Europe and your TTS in the US, you pay transatlantic latency twice.

The fix: deploy TTS in the same cloud region as your orchestrator, and use Twilio edge locations to minimize PSTN ↔ orchestrator distance. On our Europe deployments, that's 150 to 300 ms saved purely on topology. It's free, but you have to think about it.

What we measured on a pilot

On an inbound qualification voice assistant we deployed in March for a client, here's what we got after two weeks of tuning:

End-to-end median latency dropped from 1,920 ms to 620 ms. The p95 — the one that makes people hang up — went from 2,850 ms to 890 ms. Overlap rate (caller talking over the agent) fell from 18% to 4%. Abandonment in the first 30 seconds went from 23% to 9%. Post-call CSAT climbed from 6.2 to 8.4 out of 10.

The improvement comes 80% from streaming and 20% from colocation. Turn-detection tuning did the rest.

The stack we recommend, mid-2026

For the media gateway, Twilio Media Streams remains the standard. Vonage and Plivo are alternatives, but Twilio has the most mature ecosystem. The new thing this year is ConversationRelay: massively simplifies the flow, and native media latency sits under 500 ms.

For STT, Deepgram Nova-3 Flux by default. AssemblyAI Universal-Streaming if you chase absolute first-word latency (90 ms). Whisper-large-v3 self-hosted if you have strict GDPR constraints — you lose on latency (~200 ms extra), you gain on sovereignty.

For the LLM, we use Claude Haiku 4.5 or GPT-4o mini. Latency/quality/steerability trade-off is optimal. We keep Llama 3.3-70B self-hosted in the toolbox for cases where the client wants everything on their servers.

For TTS, ElevenLabs Flash v2.5 by default. Sub-100 ms first-byte, multilingual voices that sound great in French and Arabic. Deepgram Aura-2 if you want everything from one provider. Mimic 3 self-hosted if strict GDPR (but prepare for a mid-tier quality bump down).

For the orchestrator, Node.js + WebSocket gets the job done. Twilio's ecosystem is most mature in Node. Python + FastAPI also works, Go + gRPC if you want raw perf.

What it costs

The question we always get. To give an order of magnitude, per minute of conversation (50% user / 50% agent):

A premium stack (Deepgram + ElevenLabs + Claude Haiku), you're around 5.5 cents per minute. A standard stack with Whisper + OpenAI TTS + GPT-4o mini, you drop to 3 cents. A fully sovereign EU self-hosted stack, you go under 1 cent per minute, excluding VPS amortization.

The figures are orders of magnitude (Softcery has a nice calculator), but the 5× gap between stacks is real.

Four mistakes we've seen in production

Closing with traps that cost us time so you skip them.

Enabling "barge-in" without proper VAD. Barge-in lets the caller interrupt the agent by talking. Cool on paper, catastrophic in practice if your voice detection is sensitive to background noise. The agent stops at every cough, every keyboard tap, and the conversation goes erratic. Disable barge-in by default; enable only after validating your VAD on real noisy signal.

Caching tools/list on the LLM side without cache-busting when your business stack evolves. The agent calls tools that no longer exist → embarrassing verbal fallback ("uh, hang on, let me try something else…"). Add an ETag, it's free.

Picking a too-premium TTS voice for fast transactional use cases. Nobody wants to hear an audiobook voice when they call to confirm a dentist appointment. Match the voice to the situation. Short transactional, neutral and fast. Emotional or concierge, then you can level up.

Forgetting "verbal ack" patterns ("okay, noting that", "one moment…") when the LLM takes over 800 ms to respond — typically on turns that trigger a tool call. Without ack, the user thinks the line dropped. Three lines of code, changes everything.