dev-tools · 4 min read

WebRTC Is the Wrong Tool for Voice AI

WebRTC drops audio packets to keep latency low. For voice AI, that means corrupted prompts and degraded model responses.

WebRTC was built for video calls. It makes deliberate tradeoffs that make sense when two humans are talking: drop packets aggressively, keep latency low, never buffer. The conversation stays snappy even if a word gets clipped. You can fill in the gap from context.

Voice AI breaks that assumption completely. If WebRTC drops part of your prompt to an LLM, the model never recovers that context. It guesses. It hallucinates. A 200ms wait for a clean prompt is a far better outcome than a garbled one arriving on time, but WebRTC doesn't give you that choice. The protocol has already decided for you.

This isn't a configuration issue. It's architectural. A detailed technical post on moq.dev from an engineer who has built WebRTC infrastructure at both Twitch and Discord lays out exactly why.

How WebRTC Handles Bad Networks

WebRTC's jitter buffer for audio is sized between 20ms and 200ms. It renders audio based on arrival time, not timestamps. When the network gets congested, the protocol drops packets rather than buffer and retransmit. That behavior is hard-coded into browser implementations. You cannot retransmit a lost audio packet from within a browser. The spec doesn't allow it.

For a phone call, this is fine. The human on the other end fills in the gap. For a voice AI session, the model receives incomplete audio and works with whatever it gets.

The problem compounds with how OpenAI has to work around WebRTC's rendering behavior. Because WebRTC renders based on arrival time rather than timestamps, OpenAI has to introduce an artificial sleep before every audio packet just to ensure packets arrive when they're supposed to be played back. Then, when network congestion hits, those same packets get dropped to keep latency low. The artificial delay and the aggressive dropping directly cancel each other out.

OpenAI is literally introducing artificial latency, and then aggressively dropping packets to keep latency low.

The Connection Setup Tax

Before you even get to the audio quality problem, WebRTC makes you pay a significant connection setup cost. Establishing a WebRTC session requires a minimum of 8 round trips: one TCP handshake plus one TLS handshake plus one HTTP request to reach the signaling server, then ICE negotiation with the media server, then DTLS (2 RTTs), then SCTP (2 RTTs). That's before a single byte of audio moves.

QUIC collapses all of that into 1 RTT. The protocol was designed for exactly this kind of stateful, latency-sensitive connection. It's not even close.

WebRTC also drags along roughly 45 RFCs, many dating back to the early 2000s. That's a lot of surface area to implement correctly, and as the post notes, basically nobody does. Discord has forked WebRTC so extensively that its native clients only implement a small fraction of the actual protocol.

The Port Allocation Mess

WebRTC's spec says each connection gets its own ephemeral port on the server. This creates two immediate problems at scale: servers have a finite number of ports, and firewalls routinely block ephemeral port ranges. The practical result is that nobody follows the spec.

Twitch ran its WebRTC infrastructure on UDP:443 to slip past firewalls. Discord uses UDP:50000-50032, one port per CPU core. Both are workarounds that require custom infrastructure and custom load balancing logic sitting in front of everything.

Load balancing WebRTC at scale requires the balancer to maintain state, specifically a mapping from each client's source IP and port to the backend server handling that connection. OpenAI does this with a Redis instance storing that mapping. It works until a client's IP or port changes, which happens constantly: WiFi to cellular handoffs, NAT rebinding, mobile networks doing whatever mobile networks do. When the source IP/port changes, the cached mapping is stale and the connection breaks. There's no recovery path within the WebRTC model because the protocol has no way to identify a connection independent of source address.

What QUIC Does Differently

QUIC solves the load balancing problem at the protocol level. Every QUIC packet carries a CONNECTION_ID chosen by the receiver, ranging from 0 to 20 bytes. The connection is identified by that ID, not by source IP and port. When a client's network changes mid-session, the CONNECTION_ID in each packet lets the server recognize the connection regardless.

This makes stateless load balancing possible. QUIC-LB lets load balancers decode the connection ID and forward packets to the right backend without maintaining any routing table or holding encryption keys. The load balancer doesn't need to know anything about the session.

QUIC also has a preferred_address feature that enables a clean anycast/unicast load balancing pattern. All servers can advertise the same anycast address for the initial handshake, then redirect clients to specific unicast addresses for the stateful part of the connection. The load balancer never needs global state. The protocol handles it.

The Path Forward

The post suggests two practical steps. In the near term, streaming audio over WebSockets is a reasonable interim approach. It uses existing TCP/HTTP infrastructure, works with standard Kubernetes deployments, and doesn't require custom load balancing. TCP's head-of-line blocking, usually considered a liability, is actually desirable here: you want packets to arrive in order and complete, even if it means waiting a bit longer.

Longer term, WebTransport and Media over QUIC (MoQ) are the real destination. WebTransport brings QUIC's connection model to the browser with modern protocol semantics. MoQ's cache and fanout features aren't critical for 1:1 voice AI sessions, but the underlying transport properties are exactly what the use case needs.

Bottom Line

WebRTC isn't broken. It does exactly what it was designed to do: keep two humans talking with minimal latency, even on a bad connection. That design is just wrong for voice AI, where a dropped audio fragment corrupts the model's context and a slightly longer wait is always the better tradeoff. If you're building voice AI infrastructure today, you're either fighting WebRTC's defaults or stacking workarounds on top of workarounds. QUIC is the right foundation. The tooling is maturing, browser support is there. The question is when the ecosystem catches up.

Sources


You Might Also Like

The weekly digest

Every Sunday: the 5 AI tools, papers, and posts worth your time.

Curated by humans, sent at 9am ET. No sponsored content in the main feed — affiliates are clearly marked.