#17588: fix(relay): survive WS disconnects and MV3 worker restarts

by Unayung open 2026-02-15 23:33 View on GitHub →

size: M

## Summary The Browser Relay extension frequently loses connection and fails to recover, requiring manual re-attach. This PR adds 5 fixes to make the relay resilient to WebSocket drops, MV3 service worker termination, and tab lifecycle events. Related issues: #15099, #15989, #11142 ## Problem The current `background.js` has several failure modes: 1. **WebSocket drop → debugger detach**: When the relay WebSocket closes (e.g., OpenClaw restart, network blip), `onRelayClosed()` immediately detaches all debugger sessions and clears state. The user must manually click the toolbar icon again to re-attach. 2. **No auto-reconnect**: After a WS disconnect, the extension makes no attempt to reconnect. It silently goes dead. 3. **MV3 service worker death**: Chrome/Edge can terminate the service worker at any time. All in-memory state (tabs, sessions) is lost. 4. **Tab close/replace not handled**: Stale entries remain in the maps. 5. **Service worker termination**: Without periodic activity, Chrome terminates the MV3 service worker after ~30s of inactivity. ## Fixes | # | Fix | Description | |---|-----|-------------| | 1 | **Don't detach on WS drop** | Keep debugger attached, show reconnecting badge, re-announce on reconnect | | 2 | **Auto-reconnect** | Exponential backoff (1s → 30s cap) with jitter, re-announce all tabs on success | | 3 | **Session storage persistence** | Save/restore state via `chrome.storage.session` for MV3 worker restarts, with debugger re-attach | | 4 | **Tab lifecycle cleanup** | `onRemoved` / `onReplaced` listeners to clean up stale entries | | 5 | **Keepalive alarm** | `chrome.alarms` every 4 min to prevent worker termination + detect stale debuggers | ### manifest.json - Added `"alarms"` to permissions array (required for Fix #5) ## Testing Tested on Edge (Chromium-based) with an automated pipeline running via cron at 3:00 AM and 3:00 PM daily: - **12+ hours continuous uptime** without disconnects - Survived multiple OpenClaw gateway restarts (auto-reconnect worked) - Survived Edge Sleeping Tabs (keepalive prevented worker termination) - Cron jobs successfully connected to the relay without manual intervention - Badge correctly shows ⏳ during reconnect, ✅ on recovery  <h3>Greptile Summary</h3> This PR makes the Browser Relay extension resilient to WebSocket disconnects, MV3 service worker termination, and tab lifecycle events by adding five fixes: keeping debugger sessions alive across WS drops, auto-reconnect with exponential backoff, session storage persistence, tab lifecycle cleanup, and a keepalive alarm. - **Fix #1 (Don't detach on WS drop)**: `onRelayClosed` no longer tears down debugger sessions — it updates badges and preserves in-memory state for reconnection. - **Fix #2 (Auto-reconnect)**: `scheduleReconnect` uses exponential backoff (1s–30s cap) with jitter. On success, it re-validates and re-announces all tabs with thorough per-tab error handling. - **Fix #3 (Session storage persistence)**: State is persisted to `chrome.storage.session` after attach/detach and restored on service worker startup, including debugger re-attach if needed. - **Fix #4 (Tab lifecycle cleanup)**: `onRemoved` and `onReplaced` listeners clean up stale entries. - **Fix #5 (Keepalive alarm)**: 4-minute `chrome.alarms` interval pings debuggers and triggers state restore if a debugger is lost. - **Issue**: The `restoreState` re-announce loop (lines 313–330) lacks per-tab error handling — a single tab failure aborts re-announcement for all remaining tabs, unlike the analogous loop in `scheduleReconnect`. - **Note**: `childSessionToTab` is not persisted, so child sessions (iframes, service workers) will be lost across service worker restarts. New child session events will re-populate the map, but the relay server may hold stale references in the interim. <h3>Confidence Score: 3/5</h3> - This PR introduces significant resilience improvements but has several race conditions and error-handling gaps that could cause unexpected behavior under concurrent failure scenarios. - The core logic is well-structured and addresses real failure modes effectively. However, the re-announce loop in restoreState lacks per-tab error handling (new finding), and previous review threads identified race conditions in onRelayClosed (socket identity), reconnect guard disarming, and keepalive-triggered restoreState clobbering — all of which remain unresolved. The manifest change is minimal and correct. - `assets/chrome-extension/background.js` — the `restoreState` function (lines 310–334) and the race conditions noted in previous review threads <sub>Last reviewed commit: 7ab237e</sub>