Skip to main content

Reconnect Best Practices

Required reading before production. Server-side and edge peers run for hours or days: cloud instances get migrated, NATs rebind, an AI agent's host fails over, your provider drops a TCP connection during maintenance, the signalling server rolls out a deploy. A long-lived MeteredPeer that hasn't thought about reconnect looks broken; one that has just keeps running.

This guide covers: what failure modes happen in production, what the SDK handles automatically, what your code must own, the patterns that matter for a headless process, the gotchas that bite every developer once, and how to test the whole thing.

TL;DR — the things you need to do

  1. Listen for StateChange and surface "reconnecting" in your app's health/status — a log line, a metric, a readiness flag — when to == "reconnecting". (pattern below)
  2. If you opened a P2P data channel via remote.create_data_channel(...), re-open it on each per-peer StateChange"connected" event. (pattern below)
  3. If you cached remote.pc or a MediaStream, re-read / re-take them on StateChange"connected". (what survives a reconnect)
  4. Handle FatalError — auth-level close codes and a stuck token_provider surface here. Log it, then await peer.close() and construct a fresh MeteredPeer if you want to recover. (pattern below)
  5. For daemons, set reconnect=ReconnectOptions(max_attempts=float("inf")) so a transient cloud blip can't take you permanently offline. (tuning below)

The rest of this page expands each of these.

What can go wrong in production

ScenarioFrequencyDetected as
TCP connection reset by the networkCommonBrief reconnectingjoined, ~1–2 s (close code 1006)
NAT rebind / interface change on the hostCommon on edgereconnectingjoined, ~3–8 s
Cloud instance live-migrated / pausedOccasionalreconnectingjoined once the host resumes
Server deploy / rolling restartDaily-to-weeklyDisconnected with code 1001, then auto-reconnect
TURN failover (one TURN node dies)RarePer-peer reconnecting (ICE-restart ladder)
Event loop starved (blocking call on the loop)App bugInactivity watchdog fires (code 4000) after inactivity_timeout
JWT expires mid-sessionPredictable from Connected.expires_at4002 close + auto-refresh via token_provider
Plan concurrent-cap hit (after a reconnect retry)Rare4010 close, ≥30 s backoff floor
Account suspended for unpaid balanceAction-required4012 close, no retryFatalError
Admin kicked the peerAction-required4020 close, no retryFatalError
Token leaked + revokedAction-required4001 close on next reconnect → FatalError
token_provider keeps throwingAction-requiredFatalError after the consecutive-failure threshold

What the SDK handles for you

The SDK has three independent reconnect layers that cooperate. Your code rarely needs to know about the layers individually — they all surface as the same state == "reconnecting", and almost every recovery happens without your intervention. The layers exist for completeness:

Layer 1 — Signalling WebSocket auto-reconnect

The most common case. The WS drops; the SDK opens a new one with exponential backoff + jitter.

  • Backoff: starts at 0.5 s, doubles each attempt, caps at 30.0 s, ±20% jitter. (All ReconnectOptions — float seconds, the asyncio idiom.)
  • Token refresh: your token_provider() is re-called on every reconnect, so a refreshed JWT (new TURN creds, new permissions, new expiry) lands automatically. See Authentication.
  • Auto-resubscribe: the channel that was subscribed before the drop is re-subscribed after the new welcome (auto_resubscribe=True, the default). You do not call join() again.
  • Close-code-aware: terminal codes skip retry; the over-capacity code (4010) forces a ≥30 s backoff floor. See the terminal close codes table below.

Layer 2 — ICE recovery (per peer)

If a single peer's underlying connection goes to "disconnected" or "failed" (NAT rebind, TURN node failed), the SDK runs an ICE restart for that peer behind the scenes.

  • Surfaces as that peer's state == "reconnecting" while it runs.
  • Other peers in the channel are unaffected.
  • Happens entirely inside the SDK — your code just waits.
  • If a recovery attempt's negotiation fails, the peer's NegotiationError event fires. Most NegotiationError events are transient recovery noise you can ignore — the SDK keeps trying, or the next signalling-level reconnect replaces the connection. If a peer is genuinely unrecoverable, it eventually goes to "closed" (fires PeerLeft), or a full close() + re-join replaces everything.

Layer 3 — Channel-level reconcile

When the signalling WS drops, every peer in your channel is technically orphaned (their connections are still up, but the SDK can't route SDP between you). On reconnect, the SDK performs a reconcile:

  • The RemotePeer instances in peer.remote_peers are preserved across the drop. Same object identity (is returns True) and same id.
  • Each survivor's underlying connection is silently replaced with a fresh one using the latest TURN credentials. The old connection is closed. No PeerJoined / PeerLeft fires for a survivor.
  • Peers that genuinely left during the drop fire PeerLeft. Peers that joined fire PeerJoined.
  • Local media added via add_stream(...) / add_track(...) automatically reattaches to each survivor's new connection — no renegotiation cycles in your code, and the per-track metadata reattaches with it.
  • The reconcile is driven by the first presence snapshot after the resubscribe; it's self-guarding in that one bad per-peer refresh is logged and skipped rather than aborting the whole reconcile or pinning a survivor at "reconnecting" forever.

The reconcile is the most surprising layer. The key consequence is: your RemotePeer references survive, but the raw remote.pc and MediaStream objects you held don't. Full survives-a-reconnect table on the RemotePeer reference; the re-read pattern is in pitfall 3.

What your code must own

The SDK can't decide what your process does with a degraded connection. You own:

  1. Surfacing reconnect state — logs, metrics, a readiness flag for a load balancer, an alert. Whatever your ops story needs.
  2. Re-opening any P2P data channels you opened via remote.create_data_channel(...).
  3. Re-reading remote.pc / re-taking MediaStream objects if you cached them.
  4. Handling FatalError — auth rejected, account suspended, admin disconnect, or a stuck token_provider. Decide whether to exit, alert, or rebuild a fresh peer.
  5. Gating sends on state == "joined" — a send during "reconnecting" raises.

The patterns

1. Surfacing reconnect state

There's no banner to show on a headless peer — but you almost always want the transition in your logs, your metrics, or a readiness flag. The StateChange event on MeteredPeer drives this.

from metered_realtime import StateChange

connected = asyncio.Event() # e.g. a readiness flag your health check reads

@peer.on(StateChange)
def _(ev: StateChange) -> None:
log.info("peer state %s -> %s", ev.from_, ev.to)
if ev.to == "joined":
connected.set()
elif ev.to in ("reconnecting", "leaving", "closed"):
connected.clear()
if ev.to == "reconnecting":
metrics.increment("realtime.reconnect")

peer.state is also readable at any time ("idle" | "joining" | "joined" | "reconnecting" | "leaving" | "closed") if you'd rather poll than subscribe.

Tip — don't alert on the first attempt. A reconnect that succeeds within the initial 0.5 s backoff is normal background noise. If you page on every "reconnecting", you'll page constantly. Alert only when the peer has been "reconnecting" for a while:

async def watch_stuck_reconnect(peer, threshold=30.0):
while peer.state not in ("closed",):
await asyncio.sleep(5.0)
if peer.state == "reconnecting":
# crude dwell check; for precision, timestamp the StateChange instead
log.warning("peer has been reconnecting; investigate")
alert_ops("realtime peer stuck reconnecting")

2. Reopen data channels after reconnect

If — and only if — you opened a P2P data channel via remote.create_data_channel(...), the channel is tied to the old connection and won't survive a reconnect.

The pattern: don't open the channel once. Wire it to the per-peer StateChange"connected", which fires once on initial connect AND once per reconnect cycle:

from metered_realtime import PeerJoined, StateChange, DataChannel

channels: dict[str, DataChannel] = {} # peer id -> currently-open channel

@peer.on(PeerJoined)
def _(ev: PeerJoined) -> None:
remote = ev.peer

@remote.on(StateChange)
def _(sc: StateChange) -> None:
if sc.to != "connected":
return
old = channels.pop(remote.id, None)
if old is not None:
old.close() # discard the stale channel
raw = remote.create_data_channel("game-state", ordered=False)
dc = DataChannel(raw)
channels[remote.id] = dc

@peer.on(PeerJoined)
def _(ev: PeerJoined) -> None:
@ev.peer.on(StateChange)
def _(sc: StateChange) -> None:
if sc.to == "closed":
ch = channels.pop(ev.peer.id, None)
if ch is not None:
ch.close()

If you're not using P2P data channels — if all your inter-peer data goes through peer.send(...) / peer.send_to(...) — you don't need this. Those are server-routed and survive reconnect automatically.

See DataChannel for the backpressure-aware wrapper and the full P2P pattern.

3. Surviving a long outage

The SDK retries for max_attempts (default 100) before giving up. With 30-second backoffs at the cap, that's a lot of wall-clock time. Two cases:

You're a daemon that must stay up. Set max_attempts=float("inf") so a multi-hour provider outage doesn't permanently park you in "closed":

from metered_realtime import MeteredPeer, ReconnectOptions

peer = MeteredPeer(
api_key="pk_live_…",
reconnect=ReconnectOptions(max_attempts=float("inf")),
)

You hit max_attempts and the peer is now "closed". This is the SDK telling you "I gave up." A MeteredPeer is terminal after it reaches "closed" — the same instance can't rejoin. Construct a fresh one:

from metered_realtime import StateChange

async def supervise(make_peer, channel):
while True:
peer = make_peer()
closed = asyncio.Event()

@peer.on(StateChange)
def _(ev: StateChange, closed=closed) -> None:
if ev.to == "closed":
closed.set()

await peer.join(channel)
await closed.wait()
log.warning("peer closed; rebuilding from scratch")
await asyncio.sleep(1.0) # avoid a tight rebuild loop

Why a fresh instance? peer.close() is terminal — it tears down the WS and all per-peer state. Constructing a new MeteredPeer resets everything (backoff counter, cached welcome, subscribe set, peer map) and starts clean.

4. token_provider failures

If your token_provider() keeps raising, the underlying client fires TokenProviderError after 3 consecutive failures. On MeteredPeer this is consolidated onto FatalError — the peer layer routes the stuck-auth-pipeline condition there so a single handler catches it.

If you're on SignallingClient directly, you get the lower-level event, which is informational — the SDK keeps retrying:

from metered_realtime import SignallingClient, TokenProviderError

@client.on(TokenProviderError)
def _(ev: TokenProviderError) -> None:
log.warning("token_provider failed %dx: %r", ev.consecutive_failures, ev.err)
if ev.consecutive_failures >= 5:
# surface to ops / the user; don't close the client yourself
alert_ops("realtime auth pipeline failing")

Don't disconnect the client yourself from this handler — that throws away the existing connection, which may still be working with the old token until it expires. Let the SDK keep trying. See Authentication → when token_provider keeps failing.

5. FatalError — terminal conditions

The SDK won't retry the terminal close codes, and a stuck token_provider won't recover on its own. On MeteredPeer, all of these surface through one event — FatalError — so you wire a single handler:

from metered_realtime import FatalError

@peer.on(FatalError)
def _(ev: FatalError) -> None:
log.error("fatal realtime condition: %r", ev.err)
report_to_sentry(ev.err)
# Decide your recovery: exit the worker, alert ops, or rebuild a fresh peer.
# The instance is terminal regardless — it has torn down or is tearing down.

ev.err is a MeteredRealtimeError whose message string identifies the cause (invalid_token, token_expired, channel_not_authorized, account_suspended, admin_disconnect, or a token-provider-stuck message).

No err.name to switch on. Unlike the JavaScript SDK, the Python FatalError carries the symbolic cause inside the message string, not as a separate attribute. Treat FatalError as terminal and branch on what you can act on (re-login vs billing vs kicked) by inspecting the cause — don't try to switch on a name field; there isn't one. If you need to distinguish programmatically at a finer grain, listen on the underlying SignallingClient's Disconnected event and branch on ev.code against WsCloseCode.

SignallingClient customers wire on the lower-level Disconnected event instead — there's no FatalError consolidation at that layer:

from metered_realtime import WsCloseCode, Disconnected

@client.on(Disconnected)
def _(ev: Disconnected) -> None:
if ev.will_reconnect:
return # SDK is handling it

if ev.code == WsCloseCode.INVALID_TOKEN:
prompt_login("Your session is invalid. Re-authenticate.")
elif ev.code == WsCloseCode.CHANNEL_NOT_AUTHORIZED:
log.error("not authorized for that channel — fix the JWT's channels claim")
elif ev.code == WsCloseCode.ACCOUNT_SUSPENDED:
surface_billing_problem()
elif ev.code == WsCloseCode.ADMIN_DISCONNECT:
log.error("disconnected by an administrator")
else:
# max_attempts exhausted
log.error("gave up reconnecting (code %s)", ev.code)

Terminal close codes

The SDK does not retry these — it transitions straight to "closed" and (on MeteredPeer) emits a FatalError.

CodeWsCloseCode memberWhy it happenedWhat to do
4001INVALID_TOKENJWT signature / kid / format wrongRe-mint with the correct key
4003CHANNEL_NOT_AUTHORIZEDChannel not in the JWT's channels claimMint a JWT that includes the channel
4012ACCOUNT_SUSPENDEDAccount-level kill switch (unpaid, manual)Direct the user/operator to billing
4020ADMIN_DISCONNECTDisconnected via the REST APIThe peer was kicked — surface it / stop the worker

The terminal set is exactly {4001, 4003, 4012, 4020}. Two related codes are not terminal: 4002 (TOKEN_EXPIRED) reconnects after re-invoking token_provider() for a fresh JWT, and 4010 (OVER_CONCURRENT_LIMIT) reconnects with a ≥30 s backoff floor. See Errors & Codes for the full table.

Per-scenario playbook

What actually happens, step by step, in the most common failure modes:

TCP connection reset, reconnects within seconds

  1. The OS-level connection dies. The WS surfaces a close (code 1006).
  2. SDK fires Disconnected(code=1006, will_reconnect=True) (on SignallingClient).
  3. MeteredPeer transitions to state == "reconnecting"; each survivor RemotePeer goes to "reconnecting" too (without losing identity).
  4. After ~0.5 s backoff, the SDK opens a new WS.
  5. The new WS receives the welcome; Connected(is_reconnect=True) fires.
  6. Auto-resubscribe re-issues the subscribe for the channel.
  7. Channel reconcile: the post-reconnect presence snapshot identifies survivors / leavers / newcomers. Survivors' connections are silently swapped; their metadata refreshes.
  8. MeteredPeer transitions back to "joined"; each survivor goes back to "connected".

Your code sees: StateChange"reconnecting" then StateChange"joined" on the peer, and per-RemotePeer StateChange"reconnecting" then "connected".

NAT rebind / interface change

Same as a TCP reset, but the per-peer ICE state often goes "disconnected" before the WS itself notices. The per-peer ICE recovery (Layer 2) kicks in:

  1. A peer's underlying connection goes "disconnected"/"failed".
  2. That RemotePeer fires StateChange"reconnecting".
  3. The SDK restarts ICE for that peer behind the scenes.
  4. Either ICE recovers (the peer goes back to "connected") or — if the WS itself drops — Layer 3 reconcile replaces the connection instead.

If the WS stays up, only that peer's state flickers; other peers are untouched. The intermediate "connecting" hop of a rebuilt connection is filtered, so you observe connected → reconnecting → connected.

Event loop starvation (an app bug worth knowing about)

The signalling client runs an inactivity watchdog (default inactivity_timeout=60.0 s). If your code blocks the asyncio event loop — a synchronous network call, a CPU-bound loop, a blocking library call not run in an executor — the SDK's reader can't service inbound frames, and the watchdog eventually closes the stuck socket (code 4000, CLIENT_INACTIVITY) and reconnects.

  1. The event loop is blocked; inbound frames (including server pings) aren't processed.
  2. The watchdog fires after inactivity_timeout, closes the socket with 4000, and reconnects.
  3. Normal reconnect flow.

The real fix is to not block the loop — wrap blocking work in loop.run_in_executor(...) or asyncio.to_thread(...). The watchdog is a self-healing backstop, not a substitute for non-blocking code. (4000 is the SDK's own code; the server never emits it.)

Server deploy (rolling restart)

  1. The server emits a going-away frame carrying a retry_after_ms hint.
  2. SDK fires the GoingAway event with retry_after_ms — informational.
  3. The server closes with code 1001.
  4. SDK fires Disconnected(code=1001, will_reconnect=True).
  5. SDK reconnects on its normal exponential backoff.
  6. A new server instance accepts the connection. Welcome arrives.

Note: the SDK does not automatically delay by retry_after_ms — it uses its normal backoff curve. The hint exists so a fleet of peers can opt in to spreading reconnects across a deploy window. A long-running fleet of server-side peers that must coordinate with a deploy may want to honor it by listening for GoingAway and gating its own delay; for most apps the normal jittered backoff already spreads reconnects adequately.

JWT expires mid-session

  1. The server sends close code 4002 (TOKEN_EXPIRED).
  2. SDK fires Disconnected(code=4002, will_reconnect=True).
  3. SDK calls your token_provider() for a fresh JWT.
  4. Reconnect.

For this to work, token_provider() must return a fresh JWT every call — don't cache a token until it expires. Either mint a new one on every call, or cache with a tighter TTL than the JWT's exp (e.g. mint with 1 h expiry, cache for 50 min). You can predict the expiry from Connected.expires_at.

Concurrent-connection cap hit

  1. SDK tries to reconnect; the server rejects with 4010.
  2. SDK enforces a ≥30 s backoff floor on the next attempt.
  3. Either the cap clears (another connection freed up) or the SDK gives up after max_attempts.

This usually means a connection leak somewhere — old MeteredPeer / SignallingClient instances you forgot to close(). Always await peer.close() (or use async with) so the TCP FIN flushes and the server records a clean disconnect promptly. The dashboard's "Current concurrent connections" graph shows you the number.

Tuning ReconnectOptions

from metered_realtime import ReconnectOptions

ReconnectOptions(
initial_delay=0.5, # seconds — first backoff
multiplier=2.0, # doubles each attempt
max_delay=30.0, # seconds — backoff cap
jitter_ratio=0.2, # ±20% spread, so a fleet doesn't reconnect in lockstep
max_attempts=100, # float("inf") for daemons
)

Backoff is min(initial_delay * multiplier ** (attempt - 1), max_delay) ± jitter_ratio; each successful welcome resets the attempt counter. The constructor validates these — initial_delay <= 0, multiplier < 1, max_delay < initial_delay, jitter_ratio outside [0, 1], or a negative max_attempts each raise ValueError, so a zero-delay reconnect loop is impossible to configure by accident. max_attempts=float("inf") is allowed.

Pass it as the reconnect option (or reconnect=False to disable reconnect entirely — rarely what you want):

peer = MeteredPeer(
api_key="pk_live_…",
reconnect=ReconnectOptions(max_attempts=float("inf"), max_delay=15.0),
)

Leave auto_resubscribe=True on MeteredPeer. Its channel recovery depends on the post-reconnect re-subscribe; the opt-out exists on SignallingClient for manual subscription control. Turning it off on a SignallingClient and forgetting to re-subscribe means you silently miss every message after the first reconnect.

Testing your reconnect logic

You can't validate reconnect behaviour by just running the happy path. The failure modes worth testing before production:

TestHow to simulateWhat to verify
Brief network dropBlock the host's egress to rms.metered.ca for ~5 s (firewall rule, iptables, pull the interface)StateChange"reconnecting" then "joined"; buffered app sends flush after recovery
Long outageBlock egress for 60 sStays in "reconnecting"; recovers cleanly when egress restored (with max_attempts high enough)
Server-initiated kickDELETE the peer via the REST API from another processClose code 4020 surfaces; no retry; FatalError fires
JWT expiryMint a JWT with exp ~60 s out and waitCode 4002 disconnect, token_provider() called, reconnect succeeds with the fresh token
Token revocationRevoke the key in the dashboard while connectedReconnect attempt fails with 4001; FatalError fires
Data channel reopenIf using P2P channels: force a reconnect, verify channel traffic resumesPer-peer StateChange"connected" fires, opens a fresh channel, data resumes
Concurrent capOpen more connections than your plan allowsCode 4010, ≥30 s backoff floor, eventually gives up or one frees
Event-loop starvationtime.sleep(90) on the loop (don't do this in prod)Watchdog fires (code 4000) after inactivity_timeout; reconnects

Common pitfalls

  1. Alerting on every "reconnecting". A reconnect that succeeds within the 0.5 s initial backoff is normal. Alert on dwell time, not on the transition (see pattern 1).

  2. A P2P data channel silently stops carrying data after a reconnect. Caused by holding the channel across the drop. Use the per-peer StateChange"connected" pattern from pattern 2.

  3. Stale remote.pc references. Same root cause as #2 — remote.pc is a different object after a reconcile. Either re-read it every time you need it, or wire to StateChange"connected". (See RemotePeer → the pc escape hatch.)

  4. Caching a MediaStream from StreamAdded. The object identity changes on reconnect (the stream.id is stable, but it's a fresh object). Re-take it on each StreamAdded, keyed by stream.id. StreamRemoved is suppressed during a reconnect, so it won't look like the peer left.

  5. token_provider() returning a stale (cached) JWT. If your mint caches and you refresh by reconnecting, you'll cycle: 4002 close → token_provider → stale token → 4002 again. Make sure mint returns fresh (see the JWT-expiry playbook).

  6. Calling peer.send(...) during "reconnecting". Raises MeteredPeerSendError with code == "not_joined". Either gate sends on peer.state == "joined", or queue them yourself and flush on StateChange"joined".

  7. Trying to rejoin with the same MeteredPeer after close() (or after it reached "closed"). Terminal. Construct a fresh instance — see pattern 3.

  8. Not handling state == "closed". On a MeteredPeer that's reached max_attempts, this is the SDK saying "I gave up." If your code just awaits the next StateChange forever, you're stuck. Supervise and rebuild (pattern 3), or run with max_attempts=float("inf").

  9. Blocking the asyncio event loop. A synchronous call on the loop stalls the SDK's reader; the inactivity watchdog will close-and-reconnect (code 4000), but you've also stalled everything else. Run blocking work in an executor / thread.

  10. auto_resubscribe=False on a SignallingClient without re-subscribing. Default is True — leave it on unless you have a specific scoped-subscription pattern. Off + no manual re-subscribe means you silently miss messages on every subsequent connection. (MeteredPeer's channel recovery requires it on.)

See also

  • MeteredPeerStateChange, the state machine, FatalError, what survives a reconnect
  • RemotePeer — what survives reconcile, what doesn't, the pc escape hatch
  • SignallingClientDisconnected, GoingAway, TokenProviderError, ReconnectOptions
  • Errors & Codes — every close code + recommended response
  • DataChannel — the backpressure-aware P2P wrapper and reconnect-aware reopen
  • Authenticationtoken_provider minting + refresh on reconnect