Example — Audio Agent

The "AI agent in the call" shape, in one process: an agent peer streams generated audio into a room via an AudioSource, and a listener peer consumes that track frame-by-frame via iter_frames. Real WebRTC, two MeteredPeers, no browser.

PyPI: metered-realtime

This is the runnable examples/audio_agent.py from the SDK, walked through end to end.

What it demonstrates

Pushing synthesized audio into a room with AudioSource — where a voice agent's text-to-speech output goes
Consuming a peer's audio track with iter_frames — where your speech-to-text reads
The 16 kHz mono → 48 kHz stereo resample and real-time pacing AudioSource.push does for you
An async Track handler, which the SDK schedules and tracks so it can run a whole consume loop

Install

pip install "metered-realtime[webrtc]"

Media needs the webrtc extra — it pulls in the WebRTC backend (aiortc + av). There's no getUserMedia server-side, so you supply the track.

Running it locally

The example reads your publishable key from METERED_KEY:

METERED_KEY=pk_live_… python examples/audio_agent.py

To point at a local or self-hosted server, set the optional METERED_URL:

METERED_KEY=pk_live_… METERED_URL=ws://localhost:9292 python examples/audio_agent.py

It prints OK — listener received N audio frames from the agent and exits 0 once media is flowing over WebRTC.

Source walkthrough

Setup — two peers, one process

Both peers read the same options. METERED_KEY is required; METERED_URL is an optional server override.

import asyncio
import math
import struct
import os
import sys

from metered_realtime import AudioSource, MediaStream, MeteredPeer, PeerJoined, Track, iter_frames

CHANNEL = "metered-realtime-audio-agent"


def _opts() -> dict[str, str]:
    key = os.environ.get("METERED_KEY")
    if not key:
        print("Set METERED_KEY=pk_live_... (a publishable key).", file=sys.stderr)
        raise SystemExit(2)
    opts = {"api_key": key}
    url = os.environ.get("METERED_URL")
    if url:
        opts["url"] = url
    return opts

A stand-in for text-to-speech

The example generates a sine tone as raw signed-16-bit mono PCM, standing in for whatever audio your real TTS produces:

def sine_pcm(freq: int = 440, seconds: float = 2.0, rate: int = 16_000) -> bytes:
    """A simple s16-mono tone — stand-in for whatever your TTS produces."""
    return b"".join(
        struct.pack("<h", int(12_000 * math.sin(2 * math.pi * freq * i / rate)))
        for i in range(int(rate * seconds))
    )

In a real agent this is your speech synthesizer emitting PCM chunks.

The agent — stream synthesized audio into the room

Construct an AudioSource for your input format (here 16 kHz mono PCM), then attach it as a track grouped under a named MediaStream. Every peer in the room — present or future — receives it:

source = AudioSource(input_rate=16_000)  # 16 kHz mono PCM in
agent.add_track(source, MediaStream(id="agent-voice"))

AudioSource is itself a track, so add_track sends it like any other. The stream id "agent-voice" is the stable name receivers see — it's how a listener correlates "the agent's voice" across the session and across reconnects.

The listener — consume the agent's track (your STT feed)

When the agent's media arrives, the SDK fires a Track event on that remote peer. The handler iterates the track with iter_frames, which yields one av.AudioFrame per 20 ms and exits cleanly when the track stops:

frames_in = [0]

@listener.on(PeerJoined)
def _(ev: PeerJoined) -> None:
    # An async Track handler is scheduled + tracked by the SDK, so we can consume
    # the whole stream here (this loop is where speech-to-text would read).
    @ev.peer.on(Track)
    async def _(t: Track) -> None:
        async for _frame in iter_frames(t.track):  # av.AudioFrame per 20 ms
            frames_in[0] += 1

The handler is async and runs a loop for the lifetime of the track. The SDK schedules and tracks it, so you don't spawn the task yourself. This loop is exactly where your speech-to-text reads: convert each av.AudioFrame to PCM and feed it to your STT.

Driving it — join order matters

try:
    # Listener joins first; the agent (joining second) is the impolite peer and so
    # drives the offer carrying its media immediately — no polite-defer wait.
    await listener.join(CHANNEL)
    await agent.join(CHANNEL)
    # Stream ~2 s of speech. push() resamples to 48 kHz and paces it out in real
    # time; for a live agent you'd push each TTS chunk as it's produced.
    await source.push(sine_pcm(seconds=2.0))
    source.end()
    await _wait(lambda: frames_in[0] > 0, 30)  # wait for media to flow over WebRTC
    await asyncio.sleep(1.0)  # let a bit more audio arrive
    print(f"OK — listener received {frames_in[0]} audio frames from the agent")
    return 0
except TimeoutError:
    print("FAILED — no audio frames received within 30s", file=sys.stderr)
    return 1
finally:
    await agent.close()
    await listener.close()

A few things are load-bearing here:

Join order. The listener joins first; the agent joins second. The second peer to join is the impolite peer, so it drives the offer carrying its media right away — there's no waiting on a polite-side renegotiation defer.
await source.push(...) queues audio for emission. It resamples your 16 kHz mono input to 48 kHz stereo internally and paces it out in real time, 20 ms at a time. It's also backpressure-aware: a faster-than-real-time producer suspends once the buffer fills rather than growing memory. For a live agent you push each TTS chunk as it's produced — not one big buffer.
source.end() signals end-of-stream: the track emits any remaining buffered audio, then stops. Call it when your producer is done so the track terminates cleanly instead of emitting silence forever.
await asyncio.sleep(1.0) lets the tail of the audio finish arriving over the network before teardown.

Always close() both peers in a finally.

Where real TTS and STT plug in

This example moves a generated tone through the same path a production voice agent uses. To build the real thing, swap the two stand-ins:

Outbound (TTS): replace sine_pcm(...) with your text-to-speech stream and await source.push(chunk) each chunk as it's produced:
```
async for chunk in tts_stream():   # bytes of s16 PCM
    await source.push(chunk)
source.end()
```

Inbound (STT): replace the frames_in[0] += 1 counter with your speech-to-text feed inside the iter_frames loop:

@ev.peer.on(Track)
async def _(t: Track) -> None:
    async for frame in iter_frames(t.track):
        feed_stt(frame)            # convert av.AudioFrame → PCM, push to STT

A real agent runs the inbound iter_frames loop through speech-to-text → an LLM → text-to-speech, and pushes each TTS chunk back out through the AudioSource as it's produced. See the AI Agent Communication guide for the full loop.

Example — Audio Agent

What it demonstrates​

Install​

Running it locally​

Source walkthrough​

Setup — two peers, one process​

A stand-in for text-to-speech​

The agent — stream synthesized audio into the room​

The listener — consume the agent's track (your STT feed)​

Driving it — join order matters​

Where real TTS and STT plug in​

See also​