← Back to PRs

#22367: fix(whatsapp): prevent permanent listener loss after abort during reconnect backoff

by mcinteerj open 2026-02-21 02:39 View on GitHub →
channel: whatsapp-web size: XS experienced-contributor
## Summary - **Problem:** When `abortSignal` fires during the reconnect backoff `sleep()` in `monitorWebChannel`, the catch block unconditionally `break`s out of the reconnect loop. This leaves `activeWebListener` as `null` permanently — even though the WhatsApp connection itself recovers. - **Why it matters:** Any action that calls `requireActiveWebListener()` (e.g. `react`, `send`) will throw when the listener is null. The gateway *appears* healthy (channels show as started) while web actions silently fail until a full restart. - **What changed:** The `catch` block after `sleep(delay, abortSignal)` now checks `stopRequested()` before breaking. If the abort was not a deliberate stop, `continue` keeps the reconnect loop alive. - **What did NOT change:** Deliberate shutdown (abort + `stopRequested() === true`) still cleanly exits. No changes to the outer channel manager restart logic in `server-channels.ts`. > **Note to maintainers:** This is a minimal targeted fix (Option C). A more robust long-term approach (Option B) would be to decouple the inner reconnect loop's abort signal from the outer one — e.g., derive a child AbortController with independent lifecycle so the channel manager can abort a current connection attempt without permanently killing the monitor's reconnect capability. Worth considering in a future refactor. ## Change Type (select all) - [x] Bug fix ## Scope (select all touched areas) - [x] Integrations ## Linked Issue/PR - Related #99 ## User-visible / Behavior Changes WhatsApp web actions (send, react, poll, etc.) no longer permanently fail after a watchdog-triggered reconnect. Previously required a full gateway restart to recover. ## Security Impact (required) - New permissions/capabilities? `No` - Secrets/tokens handling changed? `No` - New/changed network calls? `No` - Command/tool execution surface changed? `No` - Data access scope changed? `No` ## Repro + Verification ### Environment - OS: Ubuntu 24.04 (ARM64, OCI) - Runtime/container: Node.js, OpenClaw gateway - Integration/channel: WhatsApp Web ### Steps 1. Start gateway with WhatsApp channel 2. Wait for watchdog timeout (30m no messages) or trigger a reconnect 3. If abort signal fires during the backoff sleep, the listener is permanently lost 4. Attempt `message react` → fails with "No active WhatsApp Web listener" ### Expected - Reconnect loop continues, new listener is established, reactions work ### Actual (before fix) - Reconnect loop exits, `activeWebListener` stays null, reactions fail permanently ## Evidence - [x] Trace/log snippets — reproduced via production gateway logs showing listener null after watchdog reconnect ## Human Verification (required) - Verified scenarios: Code path analysis confirmed the unconditional `break` exits the loop; with fix, `continue` preserves the loop when `stopRequested()` is false - Edge cases checked: Deliberate shutdown still breaks cleanly; `sigintStop` path unaffected; max reconnect attempts logic unaffected - What you did **not** verify: Full e2e test with real WhatsApp connection (will verify on local deployment) ## Compatibility / Migration - Backward compatible? `Yes` - Config/env changes? `No` - Migration needed? `No` ## Failure Recovery (if this breaks) - How to disable/revert: Revert single commit, restart gateway - Known bad symptoms: If somehow `continue` causes an infinite loop (unlikely — `stopRequested()` and `maxAttempts` are both checked at loop top), the monitor would spin on reconnect attempts ## Risks and Mitigations - Risk: The `continue` skips the rest of the loop body after the sleep, jumping back to the `stopRequested()` check at the top. If the abort signal is permanently set, `stopRequested()` returns true and breaks cleanly. - Mitigation: The existing `stopRequested()` check at loop top and `maxAttempts` guard both prevent infinite loops. <!-- greptile_comment --> <h3>Greptile Summary</h3> Fixed permanent listener loss when `abortSignal` fires during reconnect backoff sleep in `monitorWebChannel`. Previously, any abort during the backoff would unconditionally break the reconnect loop, leaving `activeWebListener` as `null` even when the WhatsApp connection recovered. The fix adds a `stopRequested()` check in the catch block - only breaking on deliberate shutdown, otherwise continuing the loop to re-establish the listener. - Prevents permanent reaction failures after watchdog-triggered reconnects - Preserves clean shutdown behavior when abort is deliberate - Existing guards (`stopRequested()` at loop top, `maxAttempts` limit) prevent infinite loops <h3>Confidence Score: 5/5</h3> - Safe to merge - minimal targeted fix with proper safeguards - The change is a 7-line surgical fix to a specific edge case with clear logic. The existing loop guards (`stopRequested()` check at loop top and `maxAttempts` limit) prevent potential infinite loops. The fix preserves the original shutdown behavior while solving the permanent listener loss issue. No changes to external interfaces or side effects. - No files require special attention <sub>Last reviewed commit: e851ca8</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs