Reconnect Best Practices
Required reading before production. Server-side and edge peers run for hours or days: cloud instances get migrated, NATs rebind, an AI agent's host fails over, your provider drops a TCP connection during maintenance, the signalling server rolls out a deploy. A long-lived MeteredPeer that hasn't thought about reconnect looks broken; one that has just keeps running.
This guide covers: what failure modes happen in production, what the SDK handles automatically, what your code must own, the patterns that matter for a headless process, the gotchas that bite every developer once, and how to test the whole thing.
TL;DR — the things you need to do
- Listen for
StateChangeand surface "reconnecting" in your app's health/status — a log line, a metric, a readiness flag — whento == "reconnecting". (pattern below) - If you opened a P2P data channel via
remote.create_data_channel(...), re-open it on each per-peerStateChange→"connected"event. (pattern below) - If you cached
remote.pcor aMediaStream, re-read / re-take them onStateChange→"connected". (what survives a reconnect) - Handle
FatalError— auth-level close codes and a stucktoken_providersurface here. Log it, thenawait peer.close()and construct a freshMeteredPeerif you want to recover. (pattern below) - For daemons, set
reconnect=ReconnectOptions(max_attempts=float("inf"))so a transient cloud blip can't take you permanently offline. (tuning below)
The rest of this page expands each of these.
What can go wrong in production
| Scenario | Frequency | Detected as |
|---|---|---|
| TCP connection reset by the network | Common | Brief reconnecting → joined, ~1–2 s (close code 1006) |
| NAT rebind / interface change on the host | Common on edge | reconnecting → joined, ~3–8 s |
| Cloud instance live-migrated / paused | Occasional | reconnecting → joined once the host resumes |
| Server deploy / rolling restart | Daily-to-weekly | Disconnected with code 1001, then auto-reconnect |
| TURN failover (one TURN node dies) | Rare | Per-peer reconnecting (ICE-restart ladder) |
| Event loop starved (blocking call on the loop) | App bug | Inactivity watchdog fires (code 4000) after inactivity_timeout |
| JWT expires mid-session | Predictable from Connected.expires_at | 4002 close + auto-refresh via token_provider |
| Plan concurrent-cap hit (after a reconnect retry) | Rare | 4010 close, ≥30 s backoff floor |
| Account suspended for unpaid balance | Action-required | 4012 close, no retry → FatalError |
| Admin kicked the peer | Action-required | 4020 close, no retry → FatalError |
| Token leaked + revoked | Action-required | 4001 close on next reconnect → FatalError |
token_provider keeps throwing | Action-required | FatalError after the consecutive-failure threshold |
What the SDK handles for you
The SDK has three independent reconnect layers that cooperate. Your code rarely needs to know about the layers individually — they all surface as the same state == "reconnecting", and almost every recovery happens without your intervention. The layers exist for completeness:
Layer 1 — Signalling WebSocket auto-reconnect
The most common case. The WS drops; the SDK opens a new one with exponential backoff + jitter.
- Backoff: starts at
0.5s, doubles each attempt, caps at30.0s, ±20% jitter. (AllReconnectOptions— float seconds, the asyncio idiom.) - Token refresh: your
token_provider()is re-called on every reconnect, so a refreshed JWT (new TURN creds, new permissions, new expiry) lands automatically. See Authentication. - Auto-resubscribe: the channel that was subscribed before the drop is re-subscribed after the new welcome (
auto_resubscribe=True, the default). You do not calljoin()again. - Close-code-aware: terminal codes skip retry; the over-capacity code (4010) forces a ≥30 s backoff floor. See the terminal close codes table below.
Layer 2 — ICE recovery (per peer)
If a single peer's underlying connection goes to "disconnected" or "failed" (NAT rebind, TURN node failed), the SDK runs an ICE restart for that peer behind the scenes.
- Surfaces as that peer's
state == "reconnecting"while it runs. - Other peers in the channel are unaffected.
- Happens entirely inside the SDK — your code just waits.
- If a recovery attempt's negotiation fails, the peer's
NegotiationErrorevent fires. MostNegotiationErrorevents are transient recovery noise you can ignore — the SDK keeps trying, or the next signalling-level reconnect replaces the connection. If a peer is genuinely unrecoverable, it eventually goes to"closed"(firesPeerLeft), or a fullclose()+ re-join replaces everything.
Layer 3 — Channel-level reconcile
When the signalling WS drops, every peer in your channel is technically orphaned (their connections are still up, but the SDK can't route SDP between you). On reconnect, the SDK performs a reconcile:
- The
RemotePeerinstances inpeer.remote_peersare preserved across the drop. Same object identity (isreturnsTrue) and sameid. - Each survivor's underlying connection is silently replaced with a fresh one using the latest TURN credentials. The old connection is closed. No
PeerJoined/PeerLeftfires for a survivor. - Peers that genuinely left during the drop fire
PeerLeft. Peers that joined firePeerJoined. - Local media added via
add_stream(...)/add_track(...)automatically reattaches to each survivor's new connection — no renegotiation cycles in your code, and the per-track metadata reattaches with it. - The reconcile is driven by the first presence snapshot after the resubscribe; it's self-guarding in that one bad per-peer refresh is logged and skipped rather than aborting the whole reconcile or pinning a survivor at
"reconnecting"forever.
The reconcile is the most surprising layer. The key consequence is: your RemotePeer references survive, but the raw remote.pc and MediaStream objects you held don't. Full survives-a-reconnect table on the RemotePeer reference; the re-read pattern is in pitfall 3.
What your code must own
The SDK can't decide what your process does with a degraded connection. You own:
- Surfacing reconnect state — logs, metrics, a readiness flag for a load balancer, an alert. Whatever your ops story needs.
- Re-opening any P2P data channels you opened via
remote.create_data_channel(...). - Re-reading
remote.pc/ re-takingMediaStreamobjects if you cached them. - Handling
FatalError— auth rejected, account suspended, admin disconnect, or a stucktoken_provider. Decide whether to exit, alert, or rebuild a fresh peer. - Gating sends on
state == "joined"— a send during"reconnecting"raises.
The patterns
1. Surfacing reconnect state
There's no banner to show on a headless peer — but you almost always want the transition in your logs, your metrics, or a readiness flag. The StateChange event on MeteredPeer drives this.
from metered_realtime import StateChange
connected = asyncio.Event() # e.g. a readiness flag your health check reads
@peer.on(StateChange)
def _(ev: StateChange) -> None:
log.info("peer state %s -> %s", ev.from_, ev.to)
if ev.to == "joined":
connected.set()
elif ev.to in ("reconnecting", "leaving", "closed"):
connected.clear()
if ev.to == "reconnecting":
metrics.increment("realtime.reconnect")
peer.state is also readable at any time ("idle" | "joining" | "joined" | "reconnecting" | "leaving" | "closed") if you'd rather poll than subscribe.
Tip — don't alert on the first attempt. A reconnect that succeeds within the initial 0.5 s backoff is normal background noise. If you page on every "reconnecting", you'll page constantly. Alert only when the peer has been "reconnecting" for a while:
async def watch_stuck_reconnect(peer, threshold=30.0):
while peer.state not in ("closed",):
await asyncio.sleep(5.0)
if peer.state == "reconnecting":
# crude dwell check; for precision, timestamp the StateChange instead
log.warning("peer has been reconnecting; investigate")
alert_ops("realtime peer stuck reconnecting")
2. Reopen data channels after reconnect
If — and only if — you opened a P2P data channel via remote.create_data_channel(...), the channel is tied to the old connection and won't survive a reconnect.
The pattern: don't open the channel once. Wire it to the per-peer StateChange → "connected", which fires once on initial connect AND once per reconnect cycle:
from metered_realtime import PeerJoined, StateChange, DataChannel
channels: dict[str, DataChannel] = {} # peer id -> currently-open channel
@peer.on(PeerJoined)
def _(ev: PeerJoined) -> None:
remote = ev.peer
@remote.on(StateChange)
def _(sc: StateChange) -> None:
if sc.to != "connected":
return
old = channels.pop(remote.id, None)
if old is not None:
old.close() # discard the stale channel
raw = remote.create_data_channel("game-state", ordered=False)
dc = DataChannel(raw)
channels[remote.id] = dc
@peer.on(PeerJoined)
def _(ev: PeerJoined) -> None:
@ev.peer.on(StateChange)
def _(sc: StateChange) -> None:
if sc.to == "closed":
ch = channels.pop(ev.peer.id, None)
if ch is not None:
ch.close()
If you're not using P2P data channels — if all your inter-peer data goes through peer.send(...) / peer.send_to(...) — you don't need this. Those are server-routed and survive reconnect automatically.
See DataChannel for the backpressure-aware wrapper and the full P2P pattern.
3. Surviving a long outage
The SDK retries for max_attempts (default 100) before giving up. With 30-second backoffs at the cap, that's a lot of wall-clock time. Two cases:
You're a daemon that must stay up. Set max_attempts=float("inf") so a multi-hour provider outage doesn't permanently park you in "closed":
from metered_realtime import MeteredPeer, ReconnectOptions
peer = MeteredPeer(
api_key="pk_live_…",
reconnect=ReconnectOptions(max_attempts=float("inf")),
)
You hit max_attempts and the peer is now "closed". This is the SDK telling you "I gave up." A MeteredPeer is terminal after it reaches "closed" — the same instance can't rejoin. Construct a fresh one:
from metered_realtime import StateChange
async def supervise(make_peer, channel):
while True:
peer = make_peer()
closed = asyncio.Event()
@peer.on(StateChange)
def _(ev: StateChange, closed=closed) -> None:
if ev.to == "closed":
closed.set()
await peer.join(channel)
await closed.wait()
log.warning("peer closed; rebuilding from scratch")
await asyncio.sleep(1.0) # avoid a tight rebuild loop
Why a fresh instance? peer.close() is terminal — it tears down the WS and all per-peer state. Constructing a new MeteredPeer resets everything (backoff counter, cached welcome, subscribe set, peer map) and starts clean.
4. token_provider failures
If your token_provider() keeps raising, the underlying client fires TokenProviderError after 3 consecutive failures. On MeteredPeer this is consolidated onto FatalError — the peer layer routes the stuck-auth-pipeline condition there so a single handler catches it.
If you're on SignallingClient directly, you get the lower-level event, which is informational — the SDK keeps retrying:
from metered_realtime import SignallingClient, TokenProviderError
@client.on(TokenProviderError)
def _(ev: TokenProviderError) -> None:
log.warning("token_provider failed %dx: %r", ev.consecutive_failures, ev.err)
if ev.consecutive_failures >= 5:
# surface to ops / the user; don't close the client yourself
alert_ops("realtime auth pipeline failing")
Don't disconnect the client yourself from this handler — that throws away the existing connection, which may still be working with the old token until it expires. Let the SDK keep trying. See Authentication → when token_provider keeps failing.
5. FatalError — terminal conditions
The SDK won't retry the terminal close codes, and a stuck token_provider won't recover on its own. On MeteredPeer, all of these surface through one event — FatalError — so you wire a single handler:
from metered_realtime import FatalError
@peer.on(FatalError)
def _(ev: FatalError) -> None:
log.error("fatal realtime condition: %r", ev.err)
report_to_sentry(ev.err)
# Decide your recovery: exit the worker, alert ops, or rebuild a fresh peer.
# The instance is terminal regardless — it has torn down or is tearing down.
ev.err is a MeteredRealtimeError whose message string identifies the cause (invalid_token, token_expired, channel_not_authorized, account_suspended, admin_disconnect, or a token-provider-stuck message).
No
err.nameto switch on. Unlike the JavaScript SDK, the PythonFatalErrorcarries the symbolic cause inside the message string, not as a separate attribute. TreatFatalErroras terminal and branch on what you can act on (re-login vs billing vs kicked) by inspecting the cause — don't try toswitchon a name field; there isn't one. If you need to distinguish programmatically at a finer grain, listen on the underlyingSignallingClient'sDisconnectedevent and branch onev.codeagainstWsCloseCode.
SignallingClient customers wire on the lower-level Disconnected event instead — there's no FatalError consolidation at that layer:
from metered_realtime import WsCloseCode, Disconnected
@client.on(Disconnected)
def _(ev: Disconnected) -> None:
if ev.will_reconnect:
return # SDK is handling it
if ev.code == WsCloseCode.INVALID_TOKEN:
prompt_login("Your session is invalid. Re-authenticate.")
elif ev.code == WsCloseCode.CHANNEL_NOT_AUTHORIZED:
log.error("not authorized for that channel — fix the JWT's channels claim")
elif ev.code == WsCloseCode.ACCOUNT_SUSPENDED:
surface_billing_problem()
elif ev.code == WsCloseCode.ADMIN_DISCONNECT:
log.error("disconnected by an administrator")
else:
# max_attempts exhausted
log.error("gave up reconnecting (code %s)", ev.code)
Terminal close codes
The SDK does not retry these — it transitions straight to "closed" and (on MeteredPeer) emits a FatalError.
| Code | WsCloseCode member | Why it happened | What to do |
|---|---|---|---|
| 4001 | INVALID_TOKEN | JWT signature / kid / format wrong | Re-mint with the correct key |
| 4003 | CHANNEL_NOT_AUTHORIZED | Channel not in the JWT's channels claim | Mint a JWT that includes the channel |
| 4012 | ACCOUNT_SUSPENDED | Account-level kill switch (unpaid, manual) | Direct the user/operator to billing |
| 4020 | ADMIN_DISCONNECT | Disconnected via the REST API | The peer was kicked — surface it / stop the worker |
The terminal set is exactly {4001, 4003, 4012, 4020}. Two related codes are not terminal: 4002 (TOKEN_EXPIRED) reconnects after re-invoking token_provider() for a fresh JWT, and 4010 (OVER_CONCURRENT_LIMIT) reconnects with a ≥30 s backoff floor. See Errors & Codes for the full table.
Per-scenario playbook
What actually happens, step by step, in the most common failure modes:
TCP connection reset, reconnects within seconds
- The OS-level connection dies. The WS surfaces a close (code 1006).
- SDK fires
Disconnected(code=1006, will_reconnect=True)(onSignallingClient). MeteredPeertransitions tostate == "reconnecting"; each survivorRemotePeergoes to"reconnecting"too (without losing identity).- After ~0.5 s backoff, the SDK opens a new WS.
- The new WS receives the welcome;
Connected(is_reconnect=True)fires. - Auto-resubscribe re-issues the subscribe for the channel.
- Channel reconcile: the post-reconnect presence snapshot identifies survivors / leavers / newcomers. Survivors' connections are silently swapped; their metadata refreshes.
MeteredPeertransitions back to"joined"; each survivor goes back to"connected".
Your code sees: StateChange → "reconnecting" then StateChange → "joined" on the peer, and per-RemotePeer StateChange → "reconnecting" then "connected".
NAT rebind / interface change
Same as a TCP reset, but the per-peer ICE state often goes "disconnected" before the WS itself notices. The per-peer ICE recovery (Layer 2) kicks in:
- A peer's underlying connection goes
"disconnected"/"failed". - That
RemotePeerfiresStateChange→"reconnecting". - The SDK restarts ICE for that peer behind the scenes.
- Either ICE recovers (the peer goes back to
"connected") or — if the WS itself drops — Layer 3 reconcile replaces the connection instead.
If the WS stays up, only that peer's state flickers; other peers are untouched. The intermediate "connecting" hop of a rebuilt connection is filtered, so you observe connected → reconnecting → connected.
Event loop starvation (an app bug worth knowing about)
The signalling client runs an inactivity watchdog (default inactivity_timeout=60.0 s). If your code blocks the asyncio event loop — a synchronous network call, a CPU-bound loop, a blocking library call not run in an executor — the SDK's reader can't service inbound frames, and the watchdog eventually closes the stuck socket (code 4000, CLIENT_INACTIVITY) and reconnects.
- The event loop is blocked; inbound frames (including server pings) aren't processed.
- The watchdog fires after
inactivity_timeout, closes the socket with 4000, and reconnects. - Normal reconnect flow.
The real fix is to not block the loop — wrap blocking work in loop.run_in_executor(...) or asyncio.to_thread(...). The watchdog is a self-healing backstop, not a substitute for non-blocking code. (4000 is the SDK's own code; the server never emits it.)
Server deploy (rolling restart)
- The server emits a going-away frame carrying a
retry_after_mshint. - SDK fires the
GoingAwayevent withretry_after_ms— informational. - The server closes with code 1001.
- SDK fires
Disconnected(code=1001, will_reconnect=True). - SDK reconnects on its normal exponential backoff.
- A new server instance accepts the connection. Welcome arrives.
Note: the SDK does not automatically delay by retry_after_ms — it uses its normal backoff curve. The hint exists so a fleet of peers can opt in to spreading reconnects across a deploy window. A long-running fleet of server-side peers that must coordinate with a deploy may want to honor it by listening for GoingAway and gating its own delay; for most apps the normal jittered backoff already spreads reconnects adequately.
JWT expires mid-session
- The server sends close code 4002 (
TOKEN_EXPIRED). - SDK fires
Disconnected(code=4002, will_reconnect=True). - SDK calls your
token_provider()for a fresh JWT. - Reconnect.
For this to work, token_provider() must return a fresh JWT every call — don't cache a token until it expires. Either mint a new one on every call, or cache with a tighter TTL than the JWT's exp (e.g. mint with 1 h expiry, cache for 50 min). You can predict the expiry from Connected.expires_at.
Concurrent-connection cap hit
- SDK tries to reconnect; the server rejects with 4010.
- SDK enforces a ≥30 s backoff floor on the next attempt.
- Either the cap clears (another connection freed up) or the SDK gives up after
max_attempts.
This usually means a connection leak somewhere — old MeteredPeer / SignallingClient instances you forgot to close(). Always await peer.close() (or use async with) so the TCP FIN flushes and the server records a clean disconnect promptly. The dashboard's "Current concurrent connections" graph shows you the number.
Tuning ReconnectOptions
from metered_realtime import ReconnectOptions
ReconnectOptions(
initial_delay=0.5, # seconds — first backoff
multiplier=2.0, # doubles each attempt
max_delay=30.0, # seconds — backoff cap
jitter_ratio=0.2, # ±20% spread, so a fleet doesn't reconnect in lockstep
max_attempts=100, # float("inf") for daemons
)
Backoff is min(initial_delay * multiplier ** (attempt - 1), max_delay) ± jitter_ratio; each successful welcome resets the attempt counter. The constructor validates these — initial_delay <= 0, multiplier < 1, max_delay < initial_delay, jitter_ratio outside [0, 1], or a negative max_attempts each raise ValueError, so a zero-delay reconnect loop is impossible to configure by accident. max_attempts=float("inf") is allowed.
Pass it as the reconnect option (or reconnect=False to disable reconnect entirely — rarely what you want):
peer = MeteredPeer(
api_key="pk_live_…",
reconnect=ReconnectOptions(max_attempts=float("inf"), max_delay=15.0),
)
Leave
auto_resubscribe=TrueonMeteredPeer. Its channel recovery depends on the post-reconnect re-subscribe; the opt-out exists onSignallingClientfor manual subscription control. Turning it off on aSignallingClientand forgetting to re-subscribe means you silently miss every message after the first reconnect.
Testing your reconnect logic
You can't validate reconnect behaviour by just running the happy path. The failure modes worth testing before production:
| Test | How to simulate | What to verify |
|---|---|---|
| Brief network drop | Block the host's egress to rms.metered.ca for ~5 s (firewall rule, iptables, pull the interface) | StateChange → "reconnecting" then "joined"; buffered app sends flush after recovery |
| Long outage | Block egress for 60 s | Stays in "reconnecting"; recovers cleanly when egress restored (with max_attempts high enough) |
| Server-initiated kick | DELETE the peer via the REST API from another process | Close code 4020 surfaces; no retry; FatalError fires |
| JWT expiry | Mint a JWT with exp ~60 s out and wait | Code 4002 disconnect, token_provider() called, reconnect succeeds with the fresh token |
| Token revocation | Revoke the key in the dashboard while connected | Reconnect attempt fails with 4001; FatalError fires |
| Data channel reopen | If using P2P channels: force a reconnect, verify channel traffic resumes | Per-peer StateChange → "connected" fires, opens a fresh channel, data resumes |
| Concurrent cap | Open more connections than your plan allows | Code 4010, ≥30 s backoff floor, eventually gives up or one frees |
| Event-loop starvation | time.sleep(90) on the loop (don't do this in prod) | Watchdog fires (code 4000) after inactivity_timeout; reconnects |
Common pitfalls
Alerting on every
"reconnecting". A reconnect that succeeds within the0.5s initial backoff is normal. Alert on dwell time, not on the transition (see pattern 1).A P2P data channel silently stops carrying data after a reconnect. Caused by holding the channel across the drop. Use the per-peer
StateChange→"connected"pattern from pattern 2.Stale
remote.pcreferences. Same root cause as #2 —remote.pcis a different object after a reconcile. Either re-read it every time you need it, or wire toStateChange→"connected". (SeeRemotePeer→ thepcescape hatch.)Caching a
MediaStreamfromStreamAdded. The object identity changes on reconnect (thestream.idis stable, but it's a fresh object). Re-take it on eachStreamAdded, keyed bystream.id.StreamRemovedis suppressed during a reconnect, so it won't look like the peer left.token_provider()returning a stale (cached) JWT. If your mint caches and you refresh by reconnecting, you'll cycle: 4002 close →token_provider→ stale token → 4002 again. Make sure mint returns fresh (see the JWT-expiry playbook).Calling
peer.send(...)during"reconnecting". RaisesMeteredPeerSendErrorwithcode == "not_joined". Either gate sends onpeer.state == "joined", or queue them yourself and flush onStateChange→"joined".Trying to rejoin with the same
MeteredPeerafterclose()(or after it reached"closed"). Terminal. Construct a fresh instance — see pattern 3.Not handling
state == "closed". On aMeteredPeerthat's reachedmax_attempts, this is the SDK saying "I gave up." If your code just awaits the nextStateChangeforever, you're stuck. Supervise and rebuild (pattern 3), or run withmax_attempts=float("inf").Blocking the asyncio event loop. A synchronous call on the loop stalls the SDK's reader; the inactivity watchdog will close-and-reconnect (code 4000), but you've also stalled everything else. Run blocking work in an executor / thread.
auto_resubscribe=Falseon aSignallingClientwithout re-subscribing. Default isTrue— leave it on unless you have a specific scoped-subscription pattern. Off + no manual re-subscribe means you silently miss messages on every subsequent connection. (MeteredPeer's channel recovery requires it on.)
See also
MeteredPeer—StateChange, the state machine,FatalError, what survives a reconnectRemotePeer— what survives reconcile, what doesn't, thepcescape hatchSignallingClient—Disconnected,GoingAway,TokenProviderError,ReconnectOptions- Errors & Codes — every close code + recommended response
DataChannel— the backpressure-aware P2P wrapper and reconnect-aware reopen- Authentication —
token_providerminting + refresh on reconnect