Reconnect Best Practices

Required reading before production. Real users put their laptops to sleep, ride elevators, switch from Wi-Fi to cellular, lose tabs to memory pressure, and run into deploys. Apps that haven't thought about reconnect look broken; apps that have look magic.

This guide covers: what failure modes happen in production, what the SDK handles automatically, what your code must own, the standard UI patterns, the gotchas that bite every developer once, and how to test the whole thing.

TL;DR — the five things you need to do

Listen for state-change and surface a "reconnecting…" banner when state === "reconnecting". (pattern below)
If you use the remote.pc.createDataChannel(...) escape hatch, re-open the DC on each state-change → "connected" event. (pattern below)
Set a stuck-reconnect timeout — after 30 s of "reconnecting", surface a "still trying… retry now?" button. (pattern below)
Listen for token-provider-error and surface a "please log in again" prompt when your tokenProvider() keeps failing. (pattern below)
Handle terminal close codes explicitly — 4001 / 4003 / 4012 / 4020 — show the user what action they need to take. (table below)

The rest of this page expands each of these.

What can go wrong in production

Scenario	Frequency	Detected as
Wi-Fi drops, immediately reconnects	Common	Brief `reconnecting` → `connected`, ~1–2 s
Wi-Fi → cellular roam	Common on mobile	`reconnecting` → `connected`, ~3–8 s
Laptop sleep + wake	Common	`reconnecting` → `connected` on wake
Server deploy / rolling restart	Daily-to-weekly	`disconnected` with code 1001, then auto-reconnect
TURN failover (one TURN node dies)	Rare	Per-peer `reconnecting` (ICE-restart ladder)
Tab backgrounded (Chrome throttles)	Common on mobile	`reconnecting` after the throttle window expires
Network captive portal / coffee shop login	Common	Long `reconnecting` until user dismisses portal
JWT expires mid-session	Predictable from `expiresAt`	4002 close + auto-refresh via `tokenProvider`
Plan concurrent-cap hit (after reconnect retry)	Rare	4010 close, ≥30 s backoff
User suspended for unpaid balance	Action-required	4012 close, no retry
Admin kicked the peer	Action-required	4020 close, no retry
Token leaked + revoked	Action-required	4001 close on next reconnect

What the SDK handles for you

The SDK has three independent reconnect layers that cooperate. Your code rarely needs to know about the layers individually — they all surface as the same state === "reconnecting", and almost every recovery happens without your intervention. The layers exist for completeness:

Layer 1 — Signalling WebSocket auto-reconnect

The most common case. The WS drops; the SDK opens a new one with exponential backoff + jitter.

Backoff: starts at 500 ms, doubles each attempt, caps at 30 s, ±20% jitter.
Token refresh: your tokenProvider() is re-called on every reconnect, so a refreshed JWT (with new TURN creds, new permissions, new expiry) lands automatically.
Auto-resubscribe: every channel that was subscribed before the drop is re-subscribed after the new welcome (assuming autoResubscribe: true, which is the default).
Close-code-aware: terminal codes skip retry; rate-limit codes slow-backoff. See the terminal codes table below.

Layer 2 — ICE-restart ladder (per peer)

If a single peer's RTCPeerConnection goes to "disconnected" or "failed" (Wi-Fi→cellular roam, TURN node failed), the SDK runs an ICE-restart ladder: 9 attempts over ~121 s total.

Surfaces as that peer's state === "reconnecting" while it runs.
Other peers in the channel are unaffected.
Happens entirely inside the SDK — your code just waits. Attempts whose recovery offer goes unanswered (the other side died mid-recovery) are timed out and retried automatically (SDK ≥ 1.1).
If the whole budget is spent, that peer's negotiation-error fires once with err.name === "IceRestartExhaustedError" — the one terminal signal on that event; every other negotiation-error is transient recovery noise you can ignore. Recovery from the terminal case: the next signalling-level reconnect replaces the connection automatically, or close() + re-join.

Layer 3 — Channel-level reconcile

When the signalling WS drops, every peer in your channel is technically orphaned (their PCs are still up, but the SDK can't route SDP between you). On reconnect, the SDK performs a reconcile:

The RemotePeer instances in peer.remotePeers are preserved across the drop. Same object identity (=== returns true) and same id.
Each survivor's underlying RTCPeerConnection is silently replaced with a fresh one using the latest TURN credentials. Old PC is closed.
Peers that genuinely left during the drop fire peer-left. Peers that joined fire peer-joined.
Local streams added via peer.addStream(...) automatically reattach to each survivor's new PC — no renegotiation cycles in your code.
The reconcile is self-guarding (SDK ≥ 1.1): if the post-reconnect roster doesn't arrive, the SDK re-requests it; if that still fails, error fires with err.name === "ReconcileTimeoutError" (~30 s after the reconnect) instead of sitting in "reconnecting" forever. Wire it to the same recovery as the stuck-reconnect banner below.

The reconcile is the most surprising layer. The key consequence is: your RemotePeer references survive, but raw RTCPeerConnection / RTCDataChannel handles you held don't. Pattern is below.

What your code must own

The SDK can't decide what your UI looks like. You own:

The "reconnecting…" UI — when to show it, when to hide it, what it says.
Re-opening any P2P DataChannels you opened via the remote.pc.createDataChannel(...) escape hatch.
Handling stuck reconnects — when the SDK has been retrying for 30+ seconds and nothing's working, what do you tell the user?
Handling token-provider-error — if their auth has expired, prompt them to log in again.
Handling terminal close codes — show the right UI for each (suspended, kicked, unauthorized).

The five patterns

Most apps want a banner that appears when reconnect starts and disappears when it succeeds. The state-change event drives this.

let reconnectingBanner = null;

peer.on("state-change", ({ to }) => {
  if (to === "reconnecting") {
    reconnectingBanner = showBanner("Reconnecting…");
  } else if (reconnectingBanner) {
    reconnectingBanner.hide();
    reconnectingBanner = null;
  }
});

Tip — don't show the banner for the first attempt. If reconnect succeeds within 500 ms (the initial backoff), the banner flickers in and out so fast it's jarring. Delay showing by ~1 s:

let bannerTimer = null;

peer.on("state-change", ({ to }) => {
  if (to === "reconnecting") {
    bannerTimer = setTimeout(() => showBanner("Reconnecting…"), 1000);
  } else {
    if (bannerTimer) clearTimeout(bannerTimer);
    hideBanner();
  }
});

2. Reopen DataChannels after reconcile

If — and only if — you use the remote.pc.createDataChannel(...) escape hatch for P2P data, the channel handle is tied to the old RTCPeerConnection and won't survive a reconcile.

The pattern: don't open the DataChannel imperatively. Wire it to state-change → "connected", which fires once on initial connect AND once per reconcile cycle:

const channels = new Map(); // peerId → currently-open DataChannel

peer.on("peer-joined", ({ peer: remote }) => {
  remote.on("state-change", ({ to }) => {
    if (to !== "connected") return;

    // Drop the previous DC (closed by the PC swap, but we clean up our map).
    channels.get(remote.id)?.close();

    const dc = remote.pc.createDataChannel("game-state", { ordered: false });
    dc.onmessage = (ev) => handleGameTick(JSON.parse(ev.data));
    dc.onopen = () => console.log("DC open to", remote.id);
    channels.set(remote.id, dc);
  });
});

peer.on("peer-left", ({ peer: remote }) => {
  channels.get(remote.id)?.close();
  channels.delete(remote.id);
});

If you're not using DataChannels — if all your inter-peer data goes through peer.send(data) — you don't need this. peer.send is server-routed and survives reconnect automatically.

See Data Channels & Low Latency for the full P2P pattern.

3. Stuck in `"reconnecting"`

The SDK retries for maxAttempts (default 100) before giving up. With 30-second backoffs at the cap, that's a lot of wall-clock time. If something is genuinely broken (server outage, customer's WAF blocking WebSockets, etc.), the user should see a "retry now?" button rather than staring at "reconnecting…" forever.

let stuckTimer = null;

peer.on("state-change", ({ to }) => {
  if (to === "reconnecting") {
    stuckTimer = setTimeout(() => {
      if (peer.state === "reconnecting") {
        showStuckBanner({
          message: "Still trying to reconnect…",
          retryButton: async () => {
            await peer.close();
            peer = new MeteredPeer(opts);
            await peer.join(channel);
          },
        });
      }
    }, 30_000); // 30 s is a reasonable default; tune to your app
  } else {
    if (stuckTimer) clearTimeout(stuckTimer);
    hideStuckBanner();
  }
});

Why close() + new instance? peer.close() is terminal — the existing instance can't rejoin. Constructing a fresh MeteredPeer resets all internal state (backoff counter, cached welcome, subscribe set) and tries from scratch.

SDK ≥ 1.1 gives you a programmatic trigger too: if the connection comes back but the channel roster can't be restored, the SDK fires error with err.name === "ReconcileTimeoutError" (~30 s after the reconnect) instead of staying silent. Wire it to the same banner / retry handler:

peer.on("error", ({ err }) => {
  if (err.name === "ReconcileTimeoutError") showStuckBanner({ /* same as above */ });
});

Keep the timer-based banner as well — it also covers the case where the connection itself never comes back (the timeout error only fires once a reconnect has succeeded at the connection level).

4. `tokenProvider` failures

If your tokenProvider() keeps throwing — your backend is down, the user's session expired, a CDN cache is serving an old API — the SDK fires token-provider-error after 3 consecutive failures. This is informational; the SDK keeps retrying. Surface a "please log in again" prompt if appropriate.

client.on("token-provider-error", ({ consecutiveFailures, err }) => {
  console.warn(`tokenProvider failed ${consecutiveFailures}x:`, err);
  if (consecutiveFailures >= 5) {
    showAuthErrorBanner("Your session may have expired. Log in again to continue.");
  }
});

Don't disconnect the client yourself — that throws away the existing connection (which may still be working for now, with the old token, before its expiry). Let the SDK keep trying. The banner is purely a hint to the user.

5. Terminal close codes

The SDK won't retry these. MeteredPeer customers can wire a single error handler — the SDK forwards terminal close codes (4001/4002/4003/4012/4020), fatal server-error frames, and tokenProvider failures past the retry threshold all through the same event, with err.name carrying a symbolic code:

peer.on("error", ({ err }) => {
  switch (err.name) {
    case "invalid_token":
    case "token_expired":
      showLoginPrompt("Your session is invalid. Log in again.");
      break;
    case "channel_not_authorized":
      showError("You don't have access to that channel.");
      break;
    case "account_suspended":
      showAccountSuspendedUI();
      break;
    case "admin_disconnect":
      showKickedUI("You were disconnected by an administrator.");
      break;
    case "TokenProviderError":
      showAuthFlowBroken("Your auth pipeline keeps rejecting; check your token endpoint.");
      break;
    default:
      reportToSentry(err);
  }
});

SignallingClient customers wire on the lower-level disconnected event instead — there's no error consolidation at that layer:

import { WsCloseCode } from "@metered-ca/realtime";

client.on("disconnected", ({ code, willReconnect }) => {
  if (willReconnect) return; // SDK is handling it

  switch (code) {
    case WsCloseCode.InvalidToken:
      showLoginPrompt("Your session is invalid. Log in again.");
      break;
    case WsCloseCode.ChannelNotAuthorized:
      showError("You don't have access to that channel.");
      break;
    case WsCloseCode.AccountSuspended:
      showAccountSuspendedUI();
      break;
    case WsCloseCode.AdminDisconnect:
      showKickedUI("You were disconnected by an administrator.");
      break;
    default:
      // maxAttempts exhausted
      showGenericError("Could not reconnect. Refresh the page to try again.");
  }
});

Terminal close codes

Code	Constant	Why it happened	What to do
4001	`InvalidToken`	JWT signature / kid / format wrong	Re-mint with the correct key
4003	`ChannelNotAuthorized`	Channel not in JWT's `channels` claim	Mint a new JWT that includes the channel
4012	`AccountSuspended`	Customer-level kill switch (unpaid, manual)	Direct user to billing
4020	`AdminDisconnect`	`DELETE /v1/peers/:id` from REST API	Tell the user they were kicked

Per-scenario playbook

What actually happens, step by step, in the most common failure modes:

Wi-Fi drops, reconnects within seconds

The OS-level TCP connection times out. The WS fires close with code 1006.
SDK fires disconnected({ code: 1006, willReconnect: true }).
SDK fires state-change → "reconnecting".
After ~500 ms backoff, SDK opens a new WS.
The new WS receives welcome. SDK fires connected({ isReconnect: true }).
Auto-resubscribe issues a subscribe for each previously-subscribed channel.
(For MeteredPeer) Channel reconcile: presence diff identifies survivors / leavers / newcomers. Survivors' PCs are silently swapped.
SDK fires state-change → "connected" (or "joined" for MeteredPeer).

Your code sees: state-change → "reconnecting" then state-change → "connected". Optional banner shown / hidden if you wired one up.

Wi-Fi → cellular roam (mobile)

Same as Wi-Fi drop, but with one extra wrinkle for MeteredPeer: the per-peer ICE state often goes "disconnected" before the WS itself notices. The per-peer ICE-restart ladder kicks in:

RTCPeerConnection.iceConnectionState === "disconnected" for peer X.
SDK fires remote.state-change → "reconnecting" for peer X.
SDK calls pc.restartIce() with exponential backoff, up to 9 attempts.
Either ICE recovers (peer X back to "connected") or — if the WS itself drops — Layer 3 reconcile kicks in instead.

If the WS stays up, only that peer's state flickers; other peers are untouched.

Tab backgrounded (Chrome / mobile Safari throttling)

Backgrounded tabs get JS timers throttled aggressively. The SDK's inactivity watchdog (default 60 s) catches stuck connections:

Tab is backgrounded; the server keeps sending pings, but JS doesn't run to respond.
SDK's setTimeout-based watchdog eventually fires (slowly), or the server times out and closes (code 1001 or 1006).
When the user foregrounds the tab, the WS is dead. SDK opens a new one.
Normal reconnect flow.

For most apps, this is fine — the reconnect happens within ~1 s of foregrounding. If you need faster recovery, listen for document.visibilitychange and force a reconnect proactively:

document.addEventListener("visibilitychange", () => {
  if (!document.hidden && peer.state === "reconnecting") {
    // Sometimes the throttled WS isn't actually dead yet, but we want
    // to move things along. Construct a fresh instance.
    void retryNow();
  }
});

(Most apps don't need this. The default behaviour works.)

Server deploy (rolling restart)

Server emits a going_away frame carrying a retryAfterMs hint.
SDK fires going-away event with { retryAfterMs } — informational.
Server closes with code 1001.
SDK fires disconnected({ code: 1001, willReconnect: true }).
SDK reconnects on its normal exponential backoff.
A new server instance accepts the connection. Welcome arrives.

Note: The current SDK does NOT automatically delay by retryAfterMs — it uses its normal backoff curve. The retryAfterMs hint exists so customers can opt in to spreading reconnects across a fleet during a deploy. If you want to honor it, listen for going-away yourself and stall your own retry:

let manualDelayMs = 0;
peer.on("going-away", ({ retryAfterMs }) => {
  manualDelayMs = retryAfterMs;
});
peer.on("disconnected", async ({ willReconnect, code }) => {
  if (willReconnect && manualDelayMs > 0) {
    // Disable auto-reconnect to take control, sleep, then reconnect.
    // (Most apps don't bother — normal backoff is fine for browser clients.)
  }
});

For browser-side apps this rarely matters — the normal backoff jitter already spreads reconnects across the fleet adequately. Long-running server-side processes that must coordinate with a deploy window may want to honor the hint.

JWT expires mid-session

Server sends close code 4002 (TokenExpired).
SDK fires disconnected({ code: 4002, willReconnect: true }).
SDK calls your tokenProvider() to get a fresh JWT.
Reconnect.

For this to work, tokenProvider() must return a fresh JWT every call — don't cache a token until it expires. Either mint a new one on every call, or cache with a tighter TTL than the JWT's exp (e.g. mint with 1 h expiry, cache for 50 minutes).

Concurrent-connection cap hit

SDK tries to reconnect; server rejects with 4010.
SDK enforces a ≥30 s backoff floor.
Either the cap clears (another connection freed up) or the SDK gives up after maxAttempts.

This usually means a connection leak somewhere — old MeteredPeer instances you forgot to close(). Check your state === "closed" cleanup. The dashboard's "Current concurrent connections" graph shows you the number.

Testing your reconnect logic

You can't validate reconnect behaviour by just running the happy path. The following list covers the failure modes worth testing before going to production.

Test	How to simulate	What to verify
Brief network drop	DevTools → Network → "Offline" for 5 s	Banner appears, then hides; messages buffered locally arrive after reconnect
Long network drop	"Offline" for 60 s	Stuck-reconnect banner appears at ~30 s; recovers cleanly when network restored
Server-initiated close	`DELETE /v1/peers/:id` from another tab using the REST API	Terminal close code 4020 surfaces; no retry; "kicked" UI shown
JWT expiry	Mint a JWT with `exp` 60 s out and wait	Code 4002 disconnect, `tokenProvider()` called, reconnect succeeds with fresh token
Token revocation	Revoke the key from the dashboard while connected	Disconnects + reconnect attempt fails with 4001; "log in again" UI shown
DataChannel reopen	If using P2P DCs: force a reconnect, verify DC traffic resumes	Your `state-change` → `"connected"` handler fires, opens a fresh DC, data resumes
Concurrent cap	Open more connections than your plan allows	Code 4010, slow backoff, eventually gives up or one frees
Tab backgrounded	Switch tabs for 5 minutes	On foreground, reconnects within seconds; messages from while-backgrounded delivered

Common pitfalls

Reconnect banner that flickers. Caused by not adding a 1 s grace period before showing it (see pattern 1).
DataChannel silently stops carrying data after a reconnect. Caused by holding RTCDataChannel references across the drop. Use the state-change → "connected" pattern from pattern 2.
Stale remote.pc references. Same root cause as #2 — remote.pc is a different object after a reconcile. Either re-read it every time you need it, or wire to state-change.
tokenProvider() returning a stale (cached) JWT. If your mint endpoint caches and you refresh by reconnecting, you'll cycle: 4002 close → tokenProvider → stale token → 4002 again. Make sure mint returns fresh.
Adding state listeners on the underlying pc directly. Listeners added via remote.pc.addEventListener(...) go away when the PC is swapped on reconcile. Listen on the RemotePeer wrapper (remote.on(...)) instead — those persist.
Calling peer.send(...) during "reconnecting". Rejects with MeteredPeerSendError("not_joined") or similar. Either gate sends on state === "joined", or queue them yourself and flush on state-change → "joined".
Trying to rejoin with the same MeteredPeer after close(). Terminal. Construct a fresh instance.
Not handling state === "closed". This is the SDK telling you "I'm giving up." If your code just waits forever for the next state-change, you're stuck. Show the user a retry button.
autoResubscribe: false + forgetting to resubscribe. Default is true — leave it on unless you have a specific reason. Setting it false and not handling the reconnect-resubscribe yourself means you silently miss messages on every subsequent connection. (On MeteredPeer this is now enforced: passing autoResubscribe: false throws a TypeError at construction in SDK ≥ 1.1, because channel recovery depends on the re-subscribe. The pitfall only applies to direct SignallingClient users.)
Reconnect on every page load. This is fine in browsers (transient connection per tab is normal), but if you're running a long-lived Node process, set reconnect.maxAttempts: Infinity so transient cloud-provider blips don't take you offline permanently.

Reconnect Best Practices

TL;DR — the five things you need to do

What can go wrong in production

What the SDK handles for you

Layer 1 — Signalling WebSocket auto-reconnect

Layer 2 — ICE-restart ladder (per peer)

Layer 3 — Channel-level reconcile

What your code must own

The five patterns

1. Reconnecting banner

2. Reopen DataChannels after reconcile

3. Stuck in `"reconnecting"`

4. `tokenProvider` failures

5. Terminal close codes

Terminal close codes

Per-scenario playbook

Wi-Fi drops, reconnects within seconds

Wi-Fi → cellular roam (mobile)

Tab backgrounded (Chrome / mobile Safari throttling)

Server deploy (rolling restart)

JWT expires mid-session

Concurrent-connection cap hit

Testing your reconnect logic

Common pitfalls

See also

Reconnect Best Practices

TL;DR — the five things you need to do​

What can go wrong in production​

What the SDK handles for you​

Layer 1 — Signalling WebSocket auto-reconnect​

Layer 2 — ICE-restart ladder (per peer)​

Layer 3 — Channel-level reconcile​

What your code must own​

The five patterns​

1. Reconnecting banner​

2. Reopen DataChannels after reconcile​

3. Stuck in "reconnecting"​

4. tokenProvider failures​

5. Terminal close codes​

Terminal close codes​

Per-scenario playbook​

Wi-Fi drops, reconnects within seconds​

Wi-Fi → cellular roam (mobile)​

Tab backgrounded (Chrome / mobile Safari throttling)​

Server deploy (rolling restart)​

JWT expires mid-session​

Concurrent-connection cap hit​

Testing your reconnect logic​

Common pitfalls​

See also​

TL;DR — the five things you need to do

What can go wrong in production

What the SDK handles for you

Layer 1 — Signalling WebSocket auto-reconnect

Layer 2 — ICE-restart ladder (per peer)

Layer 3 — Channel-level reconcile

What your code must own

The five patterns

1. Reconnecting banner

2. Reopen DataChannels after reconcile

3. Stuck in `"reconnecting"`

4. `tokenProvider` failures

5. Terminal close codes

Terminal close codes

Per-scenario playbook

Wi-Fi drops, reconnects within seconds

Wi-Fi → cellular roam (mobile)

Tab backgrounded (Chrome / mobile Safari throttling)

Server deploy (rolling restart)

JWT expires mid-session

Concurrent-connection cap hit

Testing your reconnect logic

Common pitfalls

See also