Skip to main content

Example — Audio Agent

The "AI agent in the call" shape, in one process: an agent peer streams generated audio into a room via an AudioSource, and a listener peer consumes that track frame-by-frame via iter_frames. Real WebRTC, two MeteredPeers, no browser.

PyPI: metered-realtime

This is the runnable examples/audio_agent.py from the SDK, walked through end to end.

What it demonstrates

  • Pushing synthesized audio into a room with AudioSource — where a voice agent's text-to-speech output goes
  • Consuming a peer's audio track with iter_frames — where your speech-to-text reads
  • The 16 kHz mono → 48 kHz stereo resample and real-time pacing AudioSource.push does for you
  • An async Track handler, which the SDK schedules and tracks so it can run a whole consume loop

Install

pip install "metered-realtime[webrtc]"

Media needs the webrtc extra — it pulls in the WebRTC backend (aiortc + av). There's no getUserMedia server-side, so you supply the track.

Running it locally

The example reads your publishable key from METERED_KEY:

METERED_KEY=pk_live_… python examples/audio_agent.py

To point at a local or self-hosted server, set the optional METERED_URL:

METERED_KEY=pk_live_… METERED_URL=ws://localhost:9292 python examples/audio_agent.py

It prints OK — listener received N audio frames from the agent and exits 0 once media is flowing over WebRTC.

Source walkthrough

Setup — two peers, one process

Both peers read the same options. METERED_KEY is required; METERED_URL is an optional server override.

import asyncio
import math
import struct
import os
import sys

from metered_realtime import AudioSource, MediaStream, MeteredPeer, PeerJoined, Track, iter_frames

CHANNEL = "metered-realtime-audio-agent"


def _opts() -> dict[str, str]:
key = os.environ.get("METERED_KEY")
if not key:
print("Set METERED_KEY=pk_live_... (a publishable key).", file=sys.stderr)
raise SystemExit(2)
opts = {"api_key": key}
url = os.environ.get("METERED_URL")
if url:
opts["url"] = url
return opts

A stand-in for text-to-speech

The example generates a sine tone as raw signed-16-bit mono PCM, standing in for whatever audio your real TTS produces:

def sine_pcm(freq: int = 440, seconds: float = 2.0, rate: int = 16_000) -> bytes:
"""A simple s16-mono tone — stand-in for whatever your TTS produces."""
return b"".join(
struct.pack("<h", int(12_000 * math.sin(2 * math.pi * freq * i / rate)))
for i in range(int(rate * seconds))
)

In a real agent this is your speech synthesizer emitting PCM chunks.

The agent — stream synthesized audio into the room

Construct an AudioSource for your input format (here 16 kHz mono PCM), then attach it as a track grouped under a named MediaStream. Every peer in the room — present or future — receives it:

source = AudioSource(input_rate=16_000)  # 16 kHz mono PCM in
agent.add_track(source, MediaStream(id="agent-voice"))

AudioSource is itself a track, so add_track sends it like any other. The stream id "agent-voice" is the stable name receivers see — it's how a listener correlates "the agent's voice" across the session and across reconnects.

The listener — consume the agent's track (your STT feed)

When the agent's media arrives, the SDK fires a Track event on that remote peer. The handler iterates the track with iter_frames, which yields one av.AudioFrame per 20 ms and exits cleanly when the track stops:

frames_in = [0]

@listener.on(PeerJoined)
def _(ev: PeerJoined) -> None:
# An async Track handler is scheduled + tracked by the SDK, so we can consume
# the whole stream here (this loop is where speech-to-text would read).
@ev.peer.on(Track)
async def _(t: Track) -> None:
async for _frame in iter_frames(t.track): # av.AudioFrame per 20 ms
frames_in[0] += 1

The handler is async and runs a loop for the lifetime of the track. The SDK schedules and tracks it, so you don't spawn the task yourself. This loop is exactly where your speech-to-text reads: convert each av.AudioFrame to PCM and feed it to your STT.

Driving it — join order matters

try:
# Listener joins first; the agent (joining second) is the impolite peer and so
# drives the offer carrying its media immediately — no polite-defer wait.
await listener.join(CHANNEL)
await agent.join(CHANNEL)
# Stream ~2 s of speech. push() resamples to 48 kHz and paces it out in real
# time; for a live agent you'd push each TTS chunk as it's produced.
await source.push(sine_pcm(seconds=2.0))
source.end()
await _wait(lambda: frames_in[0] > 0, 30) # wait for media to flow over WebRTC
await asyncio.sleep(1.0) # let a bit more audio arrive
print(f"OK — listener received {frames_in[0]} audio frames from the agent")
return 0
except TimeoutError:
print("FAILED — no audio frames received within 30s", file=sys.stderr)
return 1
finally:
await agent.close()
await listener.close()

A few things are load-bearing here:

  • Join order. The listener joins first; the agent joins second. The second peer to join is the impolite peer, so it drives the offer carrying its media right away — there's no waiting on a polite-side renegotiation defer.
  • await source.push(...) queues audio for emission. It resamples your 16 kHz mono input to 48 kHz stereo internally and paces it out in real time, 20 ms at a time. It's also backpressure-aware: a faster-than-real-time producer suspends once the buffer fills rather than growing memory. For a live agent you push each TTS chunk as it's produced — not one big buffer.
  • source.end() signals end-of-stream: the track emits any remaining buffered audio, then stops. Call it when your producer is done so the track terminates cleanly instead of emitting silence forever.
  • await asyncio.sleep(1.0) lets the tail of the audio finish arriving over the network before teardown.

Always close() both peers in a finally.

Where real TTS and STT plug in

This example moves a generated tone through the same path a production voice agent uses. To build the real thing, swap the two stand-ins:

  • Outbound (TTS): replace sine_pcm(...) with your text-to-speech stream and await source.push(chunk) each chunk as it's produced:

    async for chunk in tts_stream():   # bytes of s16 PCM
    await source.push(chunk)
    source.end()
  • Inbound (STT): replace the frames_in[0] += 1 counter with your speech-to-text feed inside the iter_frames loop:

    @ev.peer.on(Track)
    async def _(t: Track) -> None:
    async for frame in iter_frames(t.track):
    feed_stt(frame) # convert av.AudioFrame → PCM, push to STT

A real agent runs the inbound iter_frames loop through speech-to-text → an LLM → text-to-speech, and pushes each TTS chunk back out through the AudioSource as it's produced. See the AI Agent Communication guide for the full loop.

See also