Example — Audio Agent
The "AI agent in the call" shape, in one process: an agent peer streams generated audio into a room via an AudioSource, and a listener peer consumes that track frame-by-frame via iter_frames. Real WebRTC, two MeteredPeers, no browser.
PyPI: metered-realtime
This is the runnable examples/audio_agent.py from the SDK, walked through end to end.
What it demonstrates
- Pushing synthesized audio into a room with
AudioSource— where a voice agent's text-to-speech output goes - Consuming a peer's audio track with
iter_frames— where your speech-to-text reads - The 16 kHz mono → 48 kHz stereo resample and real-time pacing
AudioSource.pushdoes for you - An async
Trackhandler, which the SDK schedules and tracks so it can run a whole consume loop
Install
pip install "metered-realtime[webrtc]"
Media needs the webrtc extra — it pulls in the WebRTC backend (aiortc + av). There's no getUserMedia server-side, so you supply the track.
Running it locally
The example reads your publishable key from METERED_KEY:
METERED_KEY=pk_live_… python examples/audio_agent.py
To point at a local or self-hosted server, set the optional METERED_URL:
METERED_KEY=pk_live_… METERED_URL=ws://localhost:9292 python examples/audio_agent.py
It prints OK — listener received N audio frames from the agent and exits 0 once media is flowing over WebRTC.
Source walkthrough
Setup — two peers, one process
Both peers read the same options. METERED_KEY is required; METERED_URL is an optional server override.
import asyncio
import math
import struct
import os
import sys
from metered_realtime import AudioSource, MediaStream, MeteredPeer, PeerJoined, Track, iter_frames
CHANNEL = "metered-realtime-audio-agent"
def _opts() -> dict[str, str]:
key = os.environ.get("METERED_KEY")
if not key:
print("Set METERED_KEY=pk_live_... (a publishable key).", file=sys.stderr)
raise SystemExit(2)
opts = {"api_key": key}
url = os.environ.get("METERED_URL")
if url:
opts["url"] = url
return opts
A stand-in for text-to-speech
The example generates a sine tone as raw signed-16-bit mono PCM, standing in for whatever audio your real TTS produces:
def sine_pcm(freq: int = 440, seconds: float = 2.0, rate: int = 16_000) -> bytes:
"""A simple s16-mono tone — stand-in for whatever your TTS produces."""
return b"".join(
struct.pack("<h", int(12_000 * math.sin(2 * math.pi * freq * i / rate)))
for i in range(int(rate * seconds))
)
In a real agent this is your speech synthesizer emitting PCM chunks.
The agent — stream synthesized audio into the room
Construct an AudioSource for your input format (here 16 kHz mono PCM), then attach it as a track grouped under a named MediaStream. Every peer in the room — present or future — receives it:
source = AudioSource(input_rate=16_000) # 16 kHz mono PCM in
agent.add_track(source, MediaStream(id="agent-voice"))
AudioSource is itself a track, so add_track sends it like any other. The stream id "agent-voice" is the stable name receivers see — it's how a listener correlates "the agent's voice" across the session and across reconnects.
The listener — consume the agent's track (your STT feed)
When the agent's media arrives, the SDK fires a Track event on that remote peer. The handler iterates the track with iter_frames, which yields one av.AudioFrame per 20 ms and exits cleanly when the track stops:
frames_in = [0]
@listener.on(PeerJoined)
def _(ev: PeerJoined) -> None:
# An async Track handler is scheduled + tracked by the SDK, so we can consume
# the whole stream here (this loop is where speech-to-text would read).
@ev.peer.on(Track)
async def _(t: Track) -> None:
async for _frame in iter_frames(t.track): # av.AudioFrame per 20 ms
frames_in[0] += 1
The handler is async and runs a loop for the lifetime of the track. The SDK schedules and tracks it, so you don't spawn the task yourself. This loop is exactly where your speech-to-text reads: convert each av.AudioFrame to PCM and feed it to your STT.
Driving it — join order matters
try:
# Listener joins first; the agent (joining second) is the impolite peer and so
# drives the offer carrying its media immediately — no polite-defer wait.
await listener.join(CHANNEL)
await agent.join(CHANNEL)
# Stream ~2 s of speech. push() resamples to 48 kHz and paces it out in real
# time; for a live agent you'd push each TTS chunk as it's produced.
await source.push(sine_pcm(seconds=2.0))
source.end()
await _wait(lambda: frames_in[0] > 0, 30) # wait for media to flow over WebRTC
await asyncio.sleep(1.0) # let a bit more audio arrive
print(f"OK — listener received {frames_in[0]} audio frames from the agent")
return 0
except TimeoutError:
print("FAILED — no audio frames received within 30s", file=sys.stderr)
return 1
finally:
await agent.close()
await listener.close()
A few things are load-bearing here:
- Join order. The listener joins first; the agent joins second. The second peer to join is the impolite peer, so it drives the offer carrying its media right away — there's no waiting on a polite-side renegotiation defer.
await source.push(...)queues audio for emission. It resamples your 16 kHz mono input to 48 kHz stereo internally and paces it out in real time, 20 ms at a time. It's also backpressure-aware: a faster-than-real-time producer suspends once the buffer fills rather than growing memory. For a live agent you push each TTS chunk as it's produced — not one big buffer.source.end()signals end-of-stream: the track emits any remaining buffered audio, then stops. Call it when your producer is done so the track terminates cleanly instead of emitting silence forever.await asyncio.sleep(1.0)lets the tail of the audio finish arriving over the network before teardown.
Always close() both peers in a finally.
Where real TTS and STT plug in
This example moves a generated tone through the same path a production voice agent uses. To build the real thing, swap the two stand-ins:
Outbound (TTS): replace
sine_pcm(...)with your text-to-speech stream andawait source.push(chunk)each chunk as it's produced:async for chunk in tts_stream(): # bytes of s16 PCM
await source.push(chunk)
source.end()Inbound (STT): replace the
frames_in[0] += 1counter with your speech-to-text feed inside theiter_framesloop:@ev.peer.on(Track)
async def _(t: Track) -> None:
async for frame in iter_frames(t.track):
feed_stt(frame) # convert av.AudioFrame → PCM, push to STT
A real agent runs the inbound iter_frames loop through speech-to-text → an LLM → text-to-speech, and pushes each TTS chunk back out through the AudioSource as it's produced. See the AI Agent Communication guide for the full loop.
See also
- Media reference —
AudioSource.push/end,iter_frames, and the file/RTSP/device source helpers MeteredPeerreference —add_track/add_streamsend your tracks; theTrackevent delivers a peer's media- AI Agent Communication guide — the full speech-to-text → LLM → text-to-speech loop