Skip to main content

Reconnect Best Practices

Required reading before production. Real users put their laptops to sleep, ride elevators, switch from Wi-Fi to cellular, lose tabs to memory pressure, and run into deploys. Apps that haven't thought about reconnect look broken; apps that have look magic.

This guide covers: what failure modes happen in production, what the SDK handles automatically, what your code must own, the standard UI patterns, the gotchas that bite every developer once, and how to test the whole thing.

TL;DR — the five things you need to do

  1. Listen for state-change and surface a "reconnecting…" banner when state === "reconnecting". (pattern below)
  2. If you use the remote.pc.createDataChannel(...) escape hatch, re-open the DC on each state-change"connected" event. (pattern below)
  3. Set a stuck-reconnect timeout — after 30 s of "reconnecting", surface a "still trying… retry now?" button. (pattern below)
  4. Listen for token-provider-error and surface a "please log in again" prompt when your tokenProvider() keeps failing. (pattern below)
  5. Handle terminal close codes explicitly — 4001 / 4003 / 4012 / 4020 — show the user what action they need to take. (table below)

The rest of this page expands each of these.

What can go wrong in production

ScenarioFrequencyDetected as
Wi-Fi drops, immediately reconnectsCommonBrief reconnectingconnected, ~1–2 s
Wi-Fi → cellular roamCommon on mobilereconnectingconnected, ~3–8 s
Laptop sleep + wakeCommonreconnectingconnected on wake
Server deploy / rolling restartDaily-to-weeklydisconnected with code 1001, then auto-reconnect
TURN failover (one TURN node dies)RarePer-peer reconnecting (ICE-restart ladder)
Tab backgrounded (Chrome throttles)Common on mobilereconnecting after the throttle window expires
Network captive portal / coffee shop loginCommonLong reconnecting until user dismisses portal
JWT expires mid-sessionPredictable from expiresAt4002 close + auto-refresh via tokenProvider
Plan concurrent-cap hit (after reconnect retry)Rare4010 close, ≥30 s backoff
User suspended for unpaid balanceAction-required4012 close, no retry
Admin kicked the peerAction-required4020 close, no retry
Token leaked + revokedAction-required4001 close on next reconnect

What the SDK handles for you

The SDK has three independent reconnect layers that cooperate. Your code rarely needs to know about the layers individually — they all surface as the same state === "reconnecting", and almost every recovery happens without your intervention. The layers exist for completeness:

Layer 1 — Signalling WebSocket auto-reconnect

The most common case. The WS drops; the SDK opens a new one with exponential backoff + jitter.

  • Backoff: starts at 500 ms, doubles each attempt, caps at 30 s, ±20% jitter.
  • Token refresh: your tokenProvider() is re-called on every reconnect, so a refreshed JWT (with new TURN creds, new permissions, new expiry) lands automatically.
  • Auto-resubscribe: every channel that was subscribed before the drop is re-subscribed after the new welcome (assuming autoResubscribe: true, which is the default).
  • Close-code-aware: terminal codes skip retry; rate-limit codes slow-backoff. See the terminal codes table below.

Layer 2 — ICE-restart ladder (per peer)

If a single peer's RTCPeerConnection goes to "disconnected" or "failed" (Wi-Fi→cellular roam, TURN node failed), the SDK runs an ICE-restart ladder: 9 attempts over ~121 s total.

  • Surfaces as that peer's state === "reconnecting" while it runs.
  • Other peers in the channel are unaffected.
  • Happens entirely inside the SDK — your code just waits.

Layer 3 — Channel-level reconcile

When the signalling WS drops, every peer in your channel is technically orphaned (their PCs are still up, but the SDK can't route SDP between you). On reconnect, the SDK performs a reconcile:

  • The RemotePeer instances in peer.remotePeers are preserved across the drop. Same object identity (=== returns true) and same id.
  • Each survivor's underlying RTCPeerConnection is silently replaced with a fresh one using the latest TURN credentials. Old PC is closed.
  • Peers that genuinely left during the drop fire peer-left. Peers that joined fire peer-joined.
  • Local streams added via peer.addStream(...) automatically reattach to each survivor's new PC — no renegotiation cycles in your code.

The reconcile is the most surprising layer. The key consequence is: your RemotePeer references survive, but raw RTCPeerConnection / RTCDataChannel handles you held don't. Pattern is below.

What your code must own

The SDK can't decide what your UI looks like. You own:

  1. The "reconnecting…" UI — when to show it, when to hide it, what it says.
  2. Re-opening any P2P DataChannels you opened via the remote.pc.createDataChannel(...) escape hatch.
  3. Handling stuck reconnects — when the SDK has been retrying for 30+ seconds and nothing's working, what do you tell the user?
  4. Handling token-provider-error — if their auth has expired, prompt them to log in again.
  5. Handling terminal close codes — show the right UI for each (suspended, kicked, unauthorized).

The five patterns

1. Reconnecting banner

Most apps want a banner that appears when reconnect starts and disappears when it succeeds. The state-change event drives this.

let reconnectingBanner = null;

peer.on("state-change", ({ to }) => {
if (to === "reconnecting") {
reconnectingBanner = showBanner("Reconnecting…");
} else if (reconnectingBanner) {
reconnectingBanner.hide();
reconnectingBanner = null;
}
});

Tip — don't show the banner for the first attempt. If reconnect succeeds within 500 ms (the initial backoff), the banner flickers in and out so fast it's jarring. Delay showing by ~1 s:

let bannerTimer = null;

peer.on("state-change", ({ to }) => {
if (to === "reconnecting") {
bannerTimer = setTimeout(() => showBanner("Reconnecting…"), 1000);
} else {
if (bannerTimer) clearTimeout(bannerTimer);
hideBanner();
}
});

2. Reopen DataChannels after reconcile

If — and only if — you use the remote.pc.createDataChannel(...) escape hatch for P2P data, the channel handle is tied to the old RTCPeerConnection and won't survive a reconcile.

The pattern: don't open the DataChannel imperatively. Wire it to state-change"connected", which fires once on initial connect AND once per reconcile cycle:

const channels = new Map(); // peerId → currently-open DataChannel

peer.on("peer-joined", ({ peer: remote }) => {
remote.on("state-change", ({ to }) => {
if (to !== "connected") return;

// Drop the previous DC (closed by the PC swap, but we clean up our map).
channels.get(remote.id)?.close();

const dc = remote.pc.createDataChannel("game-state", { ordered: false });
dc.onmessage = (ev) => handleGameTick(JSON.parse(ev.data));
dc.onopen = () => console.log("DC open to", remote.id);
channels.set(remote.id, dc);
});
});

peer.on("peer-left", ({ peer: remote }) => {
channels.get(remote.id)?.close();
channels.delete(remote.id);
});

If you're not using DataChannels — if all your inter-peer data goes through peer.send(data) — you don't need this. peer.send is server-routed and survives reconnect automatically.

See Data Channels & Low Latency for the full P2P pattern.

3. Stuck in "reconnecting"

The SDK retries for maxAttempts (default 100) before giving up. With 30-second backoffs at the cap, that's a lot of wall-clock time. If something is genuinely broken (server outage, customer's WAF blocking WebSockets, etc.), the user should see a "retry now?" button rather than staring at "reconnecting…" forever.

let stuckTimer = null;

peer.on("state-change", ({ to }) => {
if (to === "reconnecting") {
stuckTimer = setTimeout(() => {
if (peer.state === "reconnecting") {
showStuckBanner({
message: "Still trying to reconnect…",
retryButton: async () => {
await peer.close();
peer = new MeteredPeer(opts);
await peer.join(channel);
},
});
}
}, 30_000); // 30 s is a reasonable default; tune to your app
} else {
if (stuckTimer) clearTimeout(stuckTimer);
hideStuckBanner();
}
});

Why close() + new instance? peer.close() is terminal — the existing instance can't rejoin. Constructing a fresh MeteredPeer resets all internal state (backoff counter, cached welcome, subscribe set) and tries from scratch.

4. tokenProvider failures

If your tokenProvider() keeps throwing — your backend is down, the user's session expired, a CDN cache is serving an old API — the SDK fires token-provider-error after 3 consecutive failures. This is informational; the SDK keeps retrying. Surface a "please log in again" prompt if appropriate.

client.on("token-provider-error", ({ consecutiveFailures, err }) => {
console.warn(`tokenProvider failed ${consecutiveFailures}x:`, err);
if (consecutiveFailures >= 5) {
showAuthErrorBanner("Your session may have expired. Log in again to continue.");
}
});

Don't disconnect the client yourself — that throws away the existing connection (which may still be working for now, with the old token, before its expiry). Let the SDK keep trying. The banner is purely a hint to the user.

5. Terminal close codes

The SDK won't retry these. MeteredPeer customers can wire a single error handler — the SDK forwards terminal close codes (4001/4002/4003/4012/4020), fatal server-error frames, and tokenProvider failures past the retry threshold all through the same event, with err.name carrying a symbolic code:

peer.on("error", ({ err }) => {
switch (err.name) {
case "invalid_token":
case "token_expired":
showLoginPrompt("Your session is invalid. Log in again.");
break;
case "channel_not_authorized":
showError("You don't have access to that channel.");
break;
case "account_suspended":
showAccountSuspendedUI();
break;
case "admin_disconnect":
showKickedUI("You were disconnected by an administrator.");
break;
case "TokenProviderError":
showAuthFlowBroken("Your auth pipeline keeps rejecting; check your token endpoint.");
break;
default:
reportToSentry(err);
}
});

SignallingClient customers wire on the lower-level disconnected event instead — there's no error consolidation at that layer:

import { WsCloseCode } from "@metered-ca/peer";

client.on("disconnected", ({ code, willReconnect }) => {
if (willReconnect) return; // SDK is handling it

switch (code) {
case WsCloseCode.InvalidToken:
showLoginPrompt("Your session is invalid. Log in again.");
break;
case WsCloseCode.ChannelNotAuthorized:
showError("You don't have access to that channel.");
break;
case WsCloseCode.AccountSuspended:
showAccountSuspendedUI();
break;
case WsCloseCode.AdminDisconnect:
showKickedUI("You were disconnected by an administrator.");
break;
default:
// maxAttempts exhausted
showGenericError("Could not reconnect. Refresh the page to try again.");
}
});

Terminal close codes

CodeConstantWhy it happenedWhat to do
4001InvalidTokenJWT signature / kid / format wrongRe-mint with the correct key
4003ChannelNotAuthorizedChannel not in JWT's channels claimMint a new JWT that includes the channel
4012AccountSuspendedCustomer-level kill switch (unpaid, manual)Direct user to billing
4020AdminDisconnectDELETE /v1/peers/:id from REST APITell the user they were kicked

Per-scenario playbook

What actually happens, step by step, in the most common failure modes:

Wi-Fi drops, reconnects within seconds

  1. The OS-level TCP connection times out. The WS fires close with code 1006.
  2. SDK fires disconnected({ code: 1006, willReconnect: true }).
  3. SDK fires state-change → "reconnecting".
  4. After ~500 ms backoff, SDK opens a new WS.
  5. The new WS receives welcome. SDK fires connected({ isReconnect: true }).
  6. Auto-resubscribe issues a subscribe for each previously-subscribed channel.
  7. (For MeteredPeer) Channel reconcile: presence diff identifies survivors / leavers / newcomers. Survivors' PCs are silently swapped.
  8. SDK fires state-change → "connected" (or "joined" for MeteredPeer).

Your code sees: state-change → "reconnecting" then state-change → "connected". Optional banner shown / hidden if you wired one up.

Wi-Fi → cellular roam (mobile)

Same as Wi-Fi drop, but with one extra wrinkle for MeteredPeer: the per-peer ICE state often goes "disconnected" before the WS itself notices. The per-peer ICE-restart ladder kicks in:

  1. RTCPeerConnection.iceConnectionState === "disconnected" for peer X.
  2. SDK fires remote.state-change → "reconnecting" for peer X.
  3. SDK calls pc.restartIce() with exponential backoff, up to 9 attempts.
  4. Either ICE recovers (peer X back to "connected") or — if the WS itself drops — Layer 3 reconcile kicks in instead.

If the WS stays up, only that peer's state flickers; other peers are untouched.

Tab backgrounded (Chrome / mobile Safari throttling)

Backgrounded tabs get JS timers throttled aggressively. The SDK's inactivity watchdog (default 60 s) catches stuck connections:

  1. Tab is backgrounded; the server keeps sending pings, but JS doesn't run to respond.
  2. SDK's setTimeout-based watchdog eventually fires (slowly), or the server times out and closes (code 1001 or 1006).
  3. When the user foregrounds the tab, the WS is dead. SDK opens a new one.
  4. Normal reconnect flow.

For most apps, this is fine — the reconnect happens within ~1 s of foregrounding. If you need faster recovery, listen for document.visibilitychange and force a reconnect proactively:

document.addEventListener("visibilitychange", () => {
if (!document.hidden && peer.state === "reconnecting") {
// Sometimes the throttled WS isn't actually dead yet, but we want
// to move things along. Construct a fresh instance.
void retryNow();
}
});

(Most apps don't need this. The default behaviour works.)

Server deploy (rolling restart)

  1. Server emits a going_away frame carrying a retryAfterMs hint.
  2. SDK fires going-away event with { retryAfterMs } — informational.
  3. Server closes with code 1001.
  4. SDK fires disconnected({ code: 1001, willReconnect: true }).
  5. SDK reconnects on its normal exponential backoff.
  6. A new server instance accepts the connection. Welcome arrives.

Note: The current SDK does NOT automatically delay by retryAfterMs — it uses its normal backoff curve. The retryAfterMs hint exists so customers can opt in to spreading reconnects across a fleet during a deploy. If you want to honor it, listen for going-away yourself and stall your own retry:

let manualDelayMs = 0;
peer.on("going-away", ({ retryAfterMs }) => {
manualDelayMs = retryAfterMs;
});
peer.on("disconnected", async ({ willReconnect, code }) => {
if (willReconnect && manualDelayMs > 0) {
// Disable auto-reconnect to take control, sleep, then reconnect.
// (Most apps don't bother — normal backoff is fine for browser clients.)
}
});

For browser-side apps this rarely matters — the normal backoff jitter already spreads reconnects across the fleet adequately. Long-running server-side processes that must coordinate with a deploy window may want to honor the hint.

JWT expires mid-session

  1. Server sends close code 4002 (TokenExpired).
  2. SDK fires disconnected({ code: 4002, willReconnect: true }).
  3. SDK calls your tokenProvider() to get a fresh JWT.
  4. Reconnect.

For this to work, tokenProvider() must return a fresh JWT every call — don't cache a token until it expires. Either mint a new one on every call, or cache with a tighter TTL than the JWT's exp (e.g. mint with 1 h expiry, cache for 50 minutes).

Concurrent-connection cap hit

  1. SDK tries to reconnect; server rejects with 4010.
  2. SDK enforces a ≥30 s backoff floor.
  3. Either the cap clears (another connection freed up) or the SDK gives up after maxAttempts.

This usually means a connection leak somewhere — old MeteredPeer instances you forgot to close(). Check your state === "closed" cleanup. The dashboard's "Current concurrent connections" graph shows you the number.

Testing your reconnect logic

You can't validate reconnect behaviour by just running the happy path. The following list covers the failure modes worth testing before going to production.

TestHow to simulateWhat to verify
Brief network dropDevTools → Network → "Offline" for 5 sBanner appears, then hides; messages buffered locally arrive after reconnect
Long network drop"Offline" for 60 sStuck-reconnect banner appears at ~30 s; recovers cleanly when network restored
Server-initiated closeDELETE /v1/peers/:id from another tab using the REST APITerminal close code 4020 surfaces; no retry; "kicked" UI shown
JWT expiryMint a JWT with exp 60 s out and waitCode 4002 disconnect, tokenProvider() called, reconnect succeeds with fresh token
Token revocationRevoke the key from the dashboard while connectedDisconnects + reconnect attempt fails with 4001; "log in again" UI shown
DataChannel reopenIf using P2P DCs: force a reconnect, verify DC traffic resumesYour state-change"connected" handler fires, opens a fresh DC, data resumes
Concurrent capOpen more connections than your plan allowsCode 4010, slow backoff, eventually gives up or one frees
Tab backgroundedSwitch tabs for 5 minutesOn foreground, reconnects within seconds; messages from while-backgrounded delivered

Common pitfalls

  1. Reconnect banner that flickers. Caused by not adding a 1 s grace period before showing it (see pattern 1).

  2. DataChannel silently stops carrying data after a reconnect. Caused by holding RTCDataChannel references across the drop. Use the state-change"connected" pattern from pattern 2.

  3. Stale remote.pc references. Same root cause as #2 — remote.pc is a different object after a reconcile. Either re-read it every time you need it, or wire to state-change.

  4. tokenProvider() returning a stale (cached) JWT. If your mint endpoint caches and you refresh by reconnecting, you'll cycle: 4002 close → tokenProvider → stale token → 4002 again. Make sure mint returns fresh.

  5. Adding state listeners on the underlying pc directly. Listeners added via remote.pc.addEventListener(...) go away when the PC is swapped on reconcile. Listen on the RemotePeer wrapper (remote.on(...)) instead — those persist.

  6. Calling peer.send(...) during "reconnecting". Rejects with MeteredPeerSendError("not_joined") or similar. Either gate sends on state === "joined", or queue them yourself and flush on state-change"joined".

  7. Trying to rejoin with the same MeteredPeer after close(). Terminal. Construct a fresh instance.

  8. Not handling state === "closed". This is the SDK telling you "I'm giving up." If your code just waits forever for the next state-change, you're stuck. Show the user a retry button.

  9. autoResubscribe: false + forgetting to resubscribe. Default is true — leave it on unless you have a specific reason. Setting it false and not handling the reconnect-resubscribe yourself means you silently miss messages on every subsequent connection.

  10. Reconnect on every page load. This is fine in browsers (transient connection per tab is normal), but if you're running a long-lived Node process, set reconnect.maxAttempts: Infinity so transient cloud-provider blips don't take you offline permanently.

See also