Skip to main content

WebRTC with Python: build an AI voice agent

The most interesting WebRTC peer you can write in Python isn't in a browser — it's an AI voice agent: a headless process that joins a call as a real participant, listens to the caller's audio, runs it through speech-to-text → an LLM → text-to-speech, and speaks the reply back into the room.

This guide builds exactly that with the free, open-source metered-realtime Python SDK. It handles the parts aiortc leaves to you — signalling, joining a room, multi-peer presence, auto-reconnect, and TURN — so the tutorial can focus on the agent logic. A Python agent and a browser caller share the same room because both SDKs speak the same wire protocol.

What you'll build

browser caller ──audio──▶  Python agent  ──▶ STT ──▶ LLM ──▶ TTS ──┐
▲ │
└──────────────────── agent's voice (audio) ◀───────────────┘

A long-running Python process that joins agent-room, transcribes whatever the caller says, asks an LLM for a reply, and streams the synthesized answer back as its microphone.

Prerequisites

  • Python 3.10+.
  • A free Metered key — sign up, then create a Realtime Messaging → Publishable key and enable Send (it's off by default; without it the WebRTC media never negotiates). Full steps on the Python SDK page.
  • Install the SDK with the WebRTC extra (pulls in aiortc):
pip install "metered-realtime[webrtc]"
  • An STT, an LLM, and a TTS — any provider works (Whisper/Deepgram, your LLM of choice, any TTS that streams PCM). We stub these so you can drop in yours.

Step 1 — Join a room

MeteredPeer is an async context manager. Open it, join() a channel, and run forever:

import asyncio
import os
from metered_realtime import MeteredPeer, PeerJoined

API_KEY = os.environ["METERED_KEY"] # your pk_live_... publishable key
ROOM = "agent-room"

async def main():
async with MeteredPeer(api_key=API_KEY) as peer:
@peer.on(PeerJoined)
def on_peer(ev):
print(f"caller {ev.peer.id} joined")

await peer.join(ROOM) # presence + discovery; TURN auto-applied
await asyncio.Future() # run until cancelled

asyncio.run(main())

Step 2 — Listen to the caller's audio

When a caller joins, subscribe to their Track and stream its frames with iter_frames. Each frame is an av.AudioFrame (raw PCM you can hand to STT):

from metered_realtime import Track, iter_frames

@peer.on(PeerJoined)
def on_peer(ev):
@ev.peer.on(Track)
async def on_track(t):
if t.track.kind != "audio":
return
async for frame in iter_frames(t.track):
... # feed `frame` to speech-to-text (next step)

iter_frames ends cleanly when the caller leaves or the track stops — no manual teardown.

Step 3 — The STT → LLM → TTS pipeline

Wire the frames into a streaming STT, send each transcript to your LLM, and turn the reply into PCM with a TTS. These three are stubs — swap in your providers:

async def stt_stream(frames):
"""Yield transcripts as the caller speaks. Plug in Whisper / Deepgram / etc.
Most streaming STT APIs accept an async iterator of PCM and yield text."""
async for _frame in frames:
...
yield "transcribed text"

async def llm(text: str) -> str:
"""Your LLM. Return the agent's reply for the caller's utterance."""
return f"You said: {text}"

async def tts(text: str):
"""Yield PCM chunks (bytes) for `text`. Plug in your TTS; match the
sample rate to the AudioSource below (24 kHz here)."""
yield b"" # 24 kHz mono s16 PCM

Step 4 — Speak the reply back

Give the agent a voice: one AudioSource added to the peer as a track. Push your TTS PCM into it and every peer in the room hears it — exactly like a microphone:

from metered_realtime import AudioSource

voice = AudioSource(input_rate=24_000) # match your TTS sample rate
peer.add_track(voice) # the agent's microphone

# inside on_track, after STT/LLM:
async for transcript in stt_stream(iter_frames(t.track)):
reply = await llm(transcript)
async for pcm in tts(reply):
await voice.push(pcm) # the agent speaks into the room

Putting it together

import asyncio
import os
from metered_realtime import MeteredPeer, PeerJoined, Track, AudioSource, iter_frames

API_KEY = os.environ["METERED_KEY"]
ROOM = "agent-room"

# --- plug in your providers ---
async def stt_stream(frames):
async for _frame in frames:
yield "transcribed text"

async def llm(text: str) -> str:
return f"You said: {text}"

async def tts(text: str):
yield b"" # 24 kHz mono s16 PCM
# ------------------------------

async def main():
async with MeteredPeer(api_key=API_KEY) as peer:
voice = AudioSource(input_rate=24_000) # the agent's voice
peer.add_track(voice)

@peer.on(PeerJoined)
def on_peer(ev):
print(f"caller {ev.peer.id} joined")

@ev.peer.on(Track)
async def on_track(t):
if t.track.kind != "audio":
return
async for transcript in stt_stream(iter_frames(t.track)):
reply = await llm(transcript)
async for pcm in tts(reply):
await voice.push(pcm)

await peer.join(ROOM)
await asyncio.Future()

asyncio.run(main())

Running it

export METERED_KEY=pk_live_...
python agent.py

The agent connects, joins agent-room, and waits. Now join the same room from a browser with the JavaScript SDK (new MeteredPeer({ apiKey })peer.join("agent-room")peer.addStream(mic)), and speak — the agent transcribes you, asks the LLM, and talks back. No gateway, no media server: the Python peer and the browser peer are connected directly over WebRTC, relayed by Open Relay TURN only when a direct path isn't possible.

Running inside a web app

metered-realtime is asyncio-native and doesn't own the event loop, so it drops into your existing server:

  • FastAPI / Starlette — start the agent in a lifespan handler (or per-call from an endpoint) and let it run alongside your routes.
  • Django (ASGI) — run it from an async management command or a Channels worker.
  • Plain workersasyncio.run(main()) in a container or systemd service; auto-reconnect keeps it alive across network blips.

Where to go next

The whole stack is free to start: the SDK is MIT and open source, signalling is free up to 100 connections / 100k messages a month, and TURN is 20 GB/month — no credit card.