Reconnect Best Practices
Required reading before production. Real users put their laptops to sleep, ride elevators, switch from Wi-Fi to cellular, lose tabs to memory pressure, and run into deploys. Apps that haven't thought about reconnect look broken; apps that have look magic.
This guide covers: what failure modes happen in production, what the SDK handles automatically, what your code must own, the standard UI patterns, the gotchas that bite every developer once, and how to test the whole thing.
TL;DR — the five things you need to do
- Listen for
state-changeand surface a "reconnecting…" banner whenstate === "reconnecting". (pattern below) - If you use the
remote.pc.createDataChannel(...)escape hatch, re-open the DC on eachstate-change→"connected"event. (pattern below) - Set a stuck-reconnect timeout — after 30 s of
"reconnecting", surface a "still trying… retry now?" button. (pattern below) - Listen for
token-provider-errorand surface a "please log in again" prompt when yourtokenProvider()keeps failing. (pattern below) - Handle terminal close codes explicitly — 4001 / 4003 / 4012 / 4020 — show the user what action they need to take. (table below)
The rest of this page expands each of these.
What can go wrong in production
| Scenario | Frequency | Detected as |
|---|---|---|
| Wi-Fi drops, immediately reconnects | Common | Brief reconnecting → connected, ~1–2 s |
| Wi-Fi → cellular roam | Common on mobile | reconnecting → connected, ~3–8 s |
| Laptop sleep + wake | Common | reconnecting → connected on wake |
| Server deploy / rolling restart | Daily-to-weekly | disconnected with code 1001, then auto-reconnect |
| TURN failover (one TURN node dies) | Rare | Per-peer reconnecting (ICE-restart ladder) |
| Tab backgrounded (Chrome throttles) | Common on mobile | reconnecting after the throttle window expires |
| Network captive portal / coffee shop login | Common | Long reconnecting until user dismisses portal |
| JWT expires mid-session | Predictable from expiresAt | 4002 close + auto-refresh via tokenProvider |
| Plan concurrent-cap hit (after reconnect retry) | Rare | 4010 close, ≥30 s backoff |
| User suspended for unpaid balance | Action-required | 4012 close, no retry |
| Admin kicked the peer | Action-required | 4020 close, no retry |
| Token leaked + revoked | Action-required | 4001 close on next reconnect |
What the SDK handles for you
The SDK has three independent reconnect layers that cooperate. Your code rarely needs to know about the layers individually — they all surface as the same state === "reconnecting", and almost every recovery happens without your intervention. The layers exist for completeness:
Layer 1 — Signalling WebSocket auto-reconnect
The most common case. The WS drops; the SDK opens a new one with exponential backoff + jitter.
- Backoff: starts at 500 ms, doubles each attempt, caps at 30 s, ±20% jitter.
- Token refresh: your
tokenProvider()is re-called on every reconnect, so a refreshed JWT (with new TURN creds, new permissions, new expiry) lands automatically. - Auto-resubscribe: every channel that was subscribed before the drop is re-subscribed after the new welcome (assuming
autoResubscribe: true, which is the default). - Close-code-aware: terminal codes skip retry; rate-limit codes slow-backoff. See the terminal codes table below.
Layer 2 — ICE-restart ladder (per peer)
If a single peer's RTCPeerConnection goes to "disconnected" or "failed" (Wi-Fi→cellular roam, TURN node failed), the SDK runs an ICE-restart ladder: 9 attempts over ~121 s total.
- Surfaces as that peer's
state === "reconnecting"while it runs. - Other peers in the channel are unaffected.
- Happens entirely inside the SDK — your code just waits.
Layer 3 — Channel-level reconcile
When the signalling WS drops, every peer in your channel is technically orphaned (their PCs are still up, but the SDK can't route SDP between you). On reconnect, the SDK performs a reconcile:
- The
RemotePeerinstances inpeer.remotePeersare preserved across the drop. Same object identity (===returns true) and sameid. - Each survivor's underlying
RTCPeerConnectionis silently replaced with a fresh one using the latest TURN credentials. Old PC is closed. - Peers that genuinely left during the drop fire
peer-left. Peers that joined firepeer-joined. - Local streams added via
peer.addStream(...)automatically reattach to each survivor's new PC — no renegotiation cycles in your code.
The reconcile is the most surprising layer. The key consequence is: your RemotePeer references survive, but raw RTCPeerConnection / RTCDataChannel handles you held don't. Pattern is below.
What your code must own
The SDK can't decide what your UI looks like. You own:
- The "reconnecting…" UI — when to show it, when to hide it, what it says.
- Re-opening any P2P DataChannels you opened via the
remote.pc.createDataChannel(...)escape hatch. - Handling stuck reconnects — when the SDK has been retrying for 30+ seconds and nothing's working, what do you tell the user?
- Handling
token-provider-error— if their auth has expired, prompt them to log in again. - Handling terminal close codes — show the right UI for each (suspended, kicked, unauthorized).
The five patterns
1. Reconnecting banner
Most apps want a banner that appears when reconnect starts and disappears when it succeeds. The state-change event drives this.
let reconnectingBanner = null;
peer.on("state-change", ({ to }) => {
if (to === "reconnecting") {
reconnectingBanner = showBanner("Reconnecting…");
} else if (reconnectingBanner) {
reconnectingBanner.hide();
reconnectingBanner = null;
}
});
Tip — don't show the banner for the first attempt. If reconnect succeeds within 500 ms (the initial backoff), the banner flickers in and out so fast it's jarring. Delay showing by ~1 s:
let bannerTimer = null;
peer.on("state-change", ({ to }) => {
if (to === "reconnecting") {
bannerTimer = setTimeout(() => showBanner("Reconnecting…"), 1000);
} else {
if (bannerTimer) clearTimeout(bannerTimer);
hideBanner();
}
});
2. Reopen DataChannels after reconcile
If — and only if — you use the remote.pc.createDataChannel(...) escape hatch for P2P data, the channel handle is tied to the old RTCPeerConnection and won't survive a reconcile.
The pattern: don't open the DataChannel imperatively. Wire it to state-change → "connected", which fires once on initial connect AND once per reconcile cycle:
const channels = new Map(); // peerId → currently-open DataChannel
peer.on("peer-joined", ({ peer: remote }) => {
remote.on("state-change", ({ to }) => {
if (to !== "connected") return;
// Drop the previous DC (closed by the PC swap, but we clean up our map).
channels.get(remote.id)?.close();
const dc = remote.pc.createDataChannel("game-state", { ordered: false });
dc.onmessage = (ev) => handleGameTick(JSON.parse(ev.data));
dc.onopen = () => console.log("DC open to", remote.id);
channels.set(remote.id, dc);
});
});
peer.on("peer-left", ({ peer: remote }) => {
channels.get(remote.id)?.close();
channels.delete(remote.id);
});
If you're not using DataChannels — if all your inter-peer data goes through peer.send(data) — you don't need this. peer.send is server-routed and survives reconnect automatically.
See Data Channels & Low Latency for the full P2P pattern.
3. Stuck in "reconnecting"
The SDK retries for maxAttempts (default 100) before giving up. With 30-second backoffs at the cap, that's a lot of wall-clock time. If something is genuinely broken (server outage, customer's WAF blocking WebSockets, etc.), the user should see a "retry now?" button rather than staring at "reconnecting…" forever.
let stuckTimer = null;
peer.on("state-change", ({ to }) => {
if (to === "reconnecting") {
stuckTimer = setTimeout(() => {
if (peer.state === "reconnecting") {
showStuckBanner({
message: "Still trying to reconnect…",
retryButton: async () => {
await peer.close();
peer = new MeteredPeer(opts);
await peer.join(channel);
},
});
}
}, 30_000); // 30 s is a reasonable default; tune to your app
} else {
if (stuckTimer) clearTimeout(stuckTimer);
hideStuckBanner();
}
});
Why close() + new instance? peer.close() is terminal — the existing instance can't rejoin. Constructing a fresh MeteredPeer resets all internal state (backoff counter, cached welcome, subscribe set) and tries from scratch.
4. tokenProvider failures
If your tokenProvider() keeps throwing — your backend is down, the user's session expired, a CDN cache is serving an old API — the SDK fires token-provider-error after 3 consecutive failures. This is informational; the SDK keeps retrying. Surface a "please log in again" prompt if appropriate.
client.on("token-provider-error", ({ consecutiveFailures, err }) => {
console.warn(`tokenProvider failed ${consecutiveFailures}x:`, err);
if (consecutiveFailures >= 5) {
showAuthErrorBanner("Your session may have expired. Log in again to continue.");
}
});
Don't disconnect the client yourself — that throws away the existing connection (which may still be working for now, with the old token, before its expiry). Let the SDK keep trying. The banner is purely a hint to the user.
5. Terminal close codes
The SDK won't retry these. MeteredPeer customers can wire a single error handler — the SDK forwards terminal close codes (4001/4002/4003/4012/4020), fatal server-error frames, and tokenProvider failures past the retry threshold all through the same event, with err.name carrying a symbolic code:
peer.on("error", ({ err }) => {
switch (err.name) {
case "invalid_token":
case "token_expired":
showLoginPrompt("Your session is invalid. Log in again.");
break;
case "channel_not_authorized":
showError("You don't have access to that channel.");
break;
case "account_suspended":
showAccountSuspendedUI();
break;
case "admin_disconnect":
showKickedUI("You were disconnected by an administrator.");
break;
case "TokenProviderError":
showAuthFlowBroken("Your auth pipeline keeps rejecting; check your token endpoint.");
break;
default:
reportToSentry(err);
}
});
SignallingClient customers wire on the lower-level disconnected event instead — there's no error consolidation at that layer:
import { WsCloseCode } from "@metered-ca/peer";
client.on("disconnected", ({ code, willReconnect }) => {
if (willReconnect) return; // SDK is handling it
switch (code) {
case WsCloseCode.InvalidToken:
showLoginPrompt("Your session is invalid. Log in again.");
break;
case WsCloseCode.ChannelNotAuthorized:
showError("You don't have access to that channel.");
break;
case WsCloseCode.AccountSuspended:
showAccountSuspendedUI();
break;
case WsCloseCode.AdminDisconnect:
showKickedUI("You were disconnected by an administrator.");
break;
default:
// maxAttempts exhausted
showGenericError("Could not reconnect. Refresh the page to try again.");
}
});
Terminal close codes
| Code | Constant | Why it happened | What to do |
|---|---|---|---|
| 4001 | InvalidToken | JWT signature / kid / format wrong | Re-mint with the correct key |
| 4003 | ChannelNotAuthorized | Channel not in JWT's channels claim | Mint a new JWT that includes the channel |
| 4012 | AccountSuspended | Customer-level kill switch (unpaid, manual) | Direct user to billing |
| 4020 | AdminDisconnect | DELETE /v1/peers/:id from REST API | Tell the user they were kicked |
Per-scenario playbook
What actually happens, step by step, in the most common failure modes:
Wi-Fi drops, reconnects within seconds
- The OS-level TCP connection times out. The WS fires
closewith code 1006. - SDK fires
disconnected({ code: 1006, willReconnect: true }). - SDK fires
state-change → "reconnecting". - After ~500 ms backoff, SDK opens a new WS.
- The new WS receives
welcome. SDK firesconnected({ isReconnect: true }). - Auto-resubscribe issues a
subscribefor each previously-subscribed channel. - (For
MeteredPeer) Channel reconcile: presence diff identifies survivors / leavers / newcomers. Survivors' PCs are silently swapped. - SDK fires
state-change → "connected"(or"joined"forMeteredPeer).
Your code sees: state-change → "reconnecting" then state-change → "connected". Optional banner shown / hidden if you wired one up.
Wi-Fi → cellular roam (mobile)
Same as Wi-Fi drop, but with one extra wrinkle for MeteredPeer: the per-peer ICE state often goes "disconnected" before the WS itself notices. The per-peer ICE-restart ladder kicks in:
RTCPeerConnection.iceConnectionState === "disconnected"for peer X.- SDK fires
remote.state-change → "reconnecting"for peer X. - SDK calls
pc.restartIce()with exponential backoff, up to 9 attempts. - Either ICE recovers (peer X back to
"connected") or — if the WS itself drops — Layer 3 reconcile kicks in instead.
If the WS stays up, only that peer's state flickers; other peers are untouched.
Tab backgrounded (Chrome / mobile Safari throttling)
Backgrounded tabs get JS timers throttled aggressively. The SDK's inactivity watchdog (default 60 s) catches stuck connections:
- Tab is backgrounded; the server keeps sending pings, but JS doesn't run to respond.
- SDK's
setTimeout-based watchdog eventually fires (slowly), or the server times out and closes (code 1001 or 1006). - When the user foregrounds the tab, the WS is dead. SDK opens a new one.
- Normal reconnect flow.
For most apps, this is fine — the reconnect happens within ~1 s of foregrounding. If you need faster recovery, listen for document.visibilitychange and force a reconnect proactively:
document.addEventListener("visibilitychange", () => {
if (!document.hidden && peer.state === "reconnecting") {
// Sometimes the throttled WS isn't actually dead yet, but we want
// to move things along. Construct a fresh instance.
void retryNow();
}
});
(Most apps don't need this. The default behaviour works.)
Server deploy (rolling restart)
- Server emits a
going_awayframe carrying aretryAfterMshint. - SDK fires
going-awayevent with{ retryAfterMs }— informational. - Server closes with code 1001.
- SDK fires
disconnected({ code: 1001, willReconnect: true }). - SDK reconnects on its normal exponential backoff.
- A new server instance accepts the connection. Welcome arrives.
Note: The current SDK does NOT automatically delay by retryAfterMs — it uses its normal backoff curve. The retryAfterMs hint exists so customers can opt in to spreading reconnects across a fleet during a deploy. If you want to honor it, listen for going-away yourself and stall your own retry:
let manualDelayMs = 0;
peer.on("going-away", ({ retryAfterMs }) => {
manualDelayMs = retryAfterMs;
});
peer.on("disconnected", async ({ willReconnect, code }) => {
if (willReconnect && manualDelayMs > 0) {
// Disable auto-reconnect to take control, sleep, then reconnect.
// (Most apps don't bother — normal backoff is fine for browser clients.)
}
});
For browser-side apps this rarely matters — the normal backoff jitter already spreads reconnects across the fleet adequately. Long-running server-side processes that must coordinate with a deploy window may want to honor the hint.
JWT expires mid-session
- Server sends close code 4002 (
TokenExpired). - SDK fires
disconnected({ code: 4002, willReconnect: true }). - SDK calls your
tokenProvider()to get a fresh JWT. - Reconnect.
For this to work, tokenProvider() must return a fresh JWT every call — don't cache a token until it expires. Either mint a new one on every call, or cache with a tighter TTL than the JWT's exp (e.g. mint with 1 h expiry, cache for 50 minutes).
Concurrent-connection cap hit
- SDK tries to reconnect; server rejects with 4010.
- SDK enforces a ≥30 s backoff floor.
- Either the cap clears (another connection freed up) or the SDK gives up after
maxAttempts.
This usually means a connection leak somewhere — old MeteredPeer instances you forgot to close(). Check your state === "closed" cleanup. The dashboard's "Current concurrent connections" graph shows you the number.
Testing your reconnect logic
You can't validate reconnect behaviour by just running the happy path. The following list covers the failure modes worth testing before going to production.
| Test | How to simulate | What to verify |
|---|---|---|
| Brief network drop | DevTools → Network → "Offline" for 5 s | Banner appears, then hides; messages buffered locally arrive after reconnect |
| Long network drop | "Offline" for 60 s | Stuck-reconnect banner appears at ~30 s; recovers cleanly when network restored |
| Server-initiated close | DELETE /v1/peers/:id from another tab using the REST API | Terminal close code 4020 surfaces; no retry; "kicked" UI shown |
| JWT expiry | Mint a JWT with exp 60 s out and wait | Code 4002 disconnect, tokenProvider() called, reconnect succeeds with fresh token |
| Token revocation | Revoke the key from the dashboard while connected | Disconnects + reconnect attempt fails with 4001; "log in again" UI shown |
| DataChannel reopen | If using P2P DCs: force a reconnect, verify DC traffic resumes | Your state-change → "connected" handler fires, opens a fresh DC, data resumes |
| Concurrent cap | Open more connections than your plan allows | Code 4010, slow backoff, eventually gives up or one frees |
| Tab backgrounded | Switch tabs for 5 minutes | On foreground, reconnects within seconds; messages from while-backgrounded delivered |
Common pitfalls
Reconnect banner that flickers. Caused by not adding a 1 s grace period before showing it (see pattern 1).
DataChannel silently stops carrying data after a reconnect. Caused by holding
RTCDataChannelreferences across the drop. Use thestate-change→"connected"pattern from pattern 2.Stale
remote.pcreferences. Same root cause as #2 —remote.pcis a different object after a reconcile. Either re-read it every time you need it, or wire tostate-change.tokenProvider()returning a stale (cached) JWT. If your mint endpoint caches and you refresh by reconnecting, you'll cycle: 4002 close → tokenProvider → stale token → 4002 again. Make sure mint returns fresh.Adding state listeners on the underlying
pcdirectly. Listeners added viaremote.pc.addEventListener(...)go away when the PC is swapped on reconcile. Listen on theRemotePeerwrapper (remote.on(...)) instead — those persist.Calling
peer.send(...)during"reconnecting". Rejects withMeteredPeerSendError("not_joined")or similar. Either gate sends onstate === "joined", or queue them yourself and flush onstate-change→"joined".Trying to rejoin with the same
MeteredPeerafterclose(). Terminal. Construct a fresh instance.Not handling
state === "closed". This is the SDK telling you "I'm giving up." If your code just waits forever for the nextstate-change, you're stuck. Show the user a retry button.autoResubscribe: false+ forgetting to resubscribe. Default istrue— leave it on unless you have a specific reason. Setting it false and not handling the reconnect-resubscribe yourself means you silently miss messages on every subsequent connection.Reconnect on every page load. This is fine in browsers (transient connection per tab is normal), but if you're running a long-lived Node process, set
reconnect.maxAttempts: Infinityso transient cloud-provider blips don't take you offline permanently.
See also
MeteredPeer—state-changeevent and state transitionsRemotePeer— what survives reconcile, what doesn't- Errors & Codes — every close code + recommended response
- Data Channels & Low Latency — full P2P DC pattern including reconnect-aware producer
- Authentication —
tokenProviderminting + refresh