WebRTC with Python: build an AI voice agent
The most interesting WebRTC peer you can write in Python isn't in a browser — it's an AI voice agent: a headless process that joins a call as a real participant, listens to the caller's audio, runs it through speech-to-text → an LLM → text-to-speech, and speaks the reply back into the room.
This guide builds exactly that with the free, open-source metered-realtime Python SDK. It handles the parts aiortc leaves to you — signalling, joining a room, multi-peer presence, auto-reconnect, and TURN — so the tutorial can focus on the agent logic. A Python agent and a browser caller share the same room because both SDKs speak the same wire protocol.
What you'll build
browser caller ──audio──▶ Python agent ──▶ STT ──▶ LLM ──▶ TTS ──┐
▲ │
└──────────────────── agent's voice (audio) ◀───────────────┘
A long-running Python process that joins agent-room, transcribes whatever the caller says, asks an LLM for a reply, and streams the synthesized answer back as its microphone.
Prerequisites
- Python 3.10+.
- A free Metered key — sign up, then create a Realtime Messaging → Publishable key and enable
Send(it's off by default; without it the WebRTC media never negotiates). Full steps on the Python SDK page. - Install the SDK with the WebRTC extra (pulls in aiortc):
pip install "metered-realtime[webrtc]"
- An STT, an LLM, and a TTS — any provider works (Whisper/Deepgram, your LLM of choice, any TTS that streams PCM). We stub these so you can drop in yours.
Step 1 — Join a room
MeteredPeer is an async context manager. Open it, join() a channel, and run forever:
import asyncio
import os
from metered_realtime import MeteredPeer, PeerJoined
API_KEY = os.environ["METERED_KEY"] # your pk_live_... publishable key
ROOM = "agent-room"
async def main():
async with MeteredPeer(api_key=API_KEY) as peer:
@peer.on(PeerJoined)
def on_peer(ev):
print(f"caller {ev.peer.id} joined")
await peer.join(ROOM) # presence + discovery; TURN auto-applied
await asyncio.Future() # run until cancelled
asyncio.run(main())
Step 2 — Listen to the caller's audio
When a caller joins, subscribe to their Track and stream its frames with iter_frames. Each frame is an av.AudioFrame (raw PCM you can hand to STT):
from metered_realtime import Track, iter_frames
@peer.on(PeerJoined)
def on_peer(ev):
@ev.peer.on(Track)
async def on_track(t):
if t.track.kind != "audio":
return
async for frame in iter_frames(t.track):
... # feed `frame` to speech-to-text (next step)
iter_frames ends cleanly when the caller leaves or the track stops — no manual teardown.
Step 3 — The STT → LLM → TTS pipeline
Wire the frames into a streaming STT, send each transcript to your LLM, and turn the reply into PCM with a TTS. These three are stubs — swap in your providers:
async def stt_stream(frames):
"""Yield transcripts as the caller speaks. Plug in Whisper / Deepgram / etc.
Most streaming STT APIs accept an async iterator of PCM and yield text."""
async for _frame in frames:
...
yield "transcribed text"
async def llm(text: str) -> str:
"""Your LLM. Return the agent's reply for the caller's utterance."""
return f"You said: {text}"
async def tts(text: str):
"""Yield PCM chunks (bytes) for `text`. Plug in your TTS; match the
sample rate to the AudioSource below (24 kHz here)."""
yield b"" # 24 kHz mono s16 PCM
Step 4 — Speak the reply back
Give the agent a voice: one AudioSource added to the peer as a track. Push your TTS PCM into it and every peer in the room hears it — exactly like a microphone:
from metered_realtime import AudioSource
voice = AudioSource(input_rate=24_000) # match your TTS sample rate
peer.add_track(voice) # the agent's microphone
# inside on_track, after STT/LLM:
async for transcript in stt_stream(iter_frames(t.track)):
reply = await llm(transcript)
async for pcm in tts(reply):
await voice.push(pcm) # the agent speaks into the room
Putting it together
import asyncio
import os
from metered_realtime import MeteredPeer, PeerJoined, Track, AudioSource, iter_frames
API_KEY = os.environ["METERED_KEY"]
ROOM = "agent-room"
# --- plug in your providers ---
async def stt_stream(frames):
async for _frame in frames:
yield "transcribed text"
async def llm(text: str) -> str:
return f"You said: {text}"
async def tts(text: str):
yield b"" # 24 kHz mono s16 PCM
# ------------------------------
async def main():
async with MeteredPeer(api_key=API_KEY) as peer:
voice = AudioSource(input_rate=24_000) # the agent's voice
peer.add_track(voice)
@peer.on(PeerJoined)
def on_peer(ev):
print(f"caller {ev.peer.id} joined")
@ev.peer.on(Track)
async def on_track(t):
if t.track.kind != "audio":
return
async for transcript in stt_stream(iter_frames(t.track)):
reply = await llm(transcript)
async for pcm in tts(reply):
await voice.push(pcm)
await peer.join(ROOM)
await asyncio.Future()
asyncio.run(main())
Running it
export METERED_KEY=pk_live_...
python agent.py
The agent connects, joins agent-room, and waits. Now join the same room from a browser with the JavaScript SDK (new MeteredPeer({ apiKey }) → peer.join("agent-room") → peer.addStream(mic)), and speak — the agent transcribes you, asks the LLM, and talks back. No gateway, no media server: the Python peer and the browser peer are connected directly over WebRTC, relayed by Open Relay TURN only when a direct path isn't possible.
Running inside a web app
metered-realtime is asyncio-native and doesn't own the event loop, so it drops into your existing server:
- FastAPI / Starlette — start the agent in a
lifespanhandler (or per-call from an endpoint) and let it run alongside your routes. - Django (ASGI) — run it from an async management command or a Channels worker.
- Plain workers —
asyncio.run(main())in a container or systemd service; auto-reconnect keeps it alive across network blips.
Where to go next
- Python WebRTC SDK reference — the full API, comparison with raw aiortc, and FAQ.
- Full SDK docs —
MeteredPeer, media input (AudioSource,from_file,from_rtsp), data channels, errors. - Free signalling server + Open Relay TURN — the free stack the agent runs on.
The whole stack is free to start: the SDK is MIT and open source, signalling is free up to 100 connections / 100k messages a month, and TURN is 20 GB/month — no credit card.