#9727: fix(whatsapp): retry reconnect loop on initial connection failure
channel: whatsapp-web
Cluster:
WhatsApp Connection Stability Fixes
## Summary
- Retry initial WhatsApp Web listener startup failures in `monitorWebChannel` using the existing reconnect backoff instead of exiting.
- Update reconnect status/logging for startup failures and respect `maxAttempts`.
- Add a regression test that simulates an initial `ENOTFOUND` and verifies the reconnect loop retries.
## Why
- DNS/network errors during the very first WhatsApp connection (for example `ENOTFOUND web.whatsapp.com`) previously escaped the reconnect loop, causing the gateway to stop. This change makes initial connection failures behave like later reconnects and fixes #13506.
## Log Evidence
- Original bug (2026-02-05 07:16:14 UTC, production OpenClaw 2026.2.3): reconnect loop did not engage; channel remained dead until manual restart at 13:06.
```text
{"error":"Error: getaddrinfo ENOTFOUND web.whatsapp.com"},"WebSocket error"
path: "opt/homebrew/lib/node_modules/openclaw/dist/web/session.js:117"
time: "2026-02-05T07:16:14.679Z"
```
- Fix working (2026-02-05 15:01:56 UTC, dev build with fix): new "will retry" log indicates the initial failure is captured and the reconnect loop continues.
```text
{"error":"ENOTFOUND web.whatsapp.com","reconnectAttempts":0},"web reconnect: failed to establish initial connection; will retry"
path: "/Users/lsantos/Projects/openclaw/src/web/auto-reply/monitor.ts:214"
time: "2026-02-05T15:01:56.442Z"
```
## Testing
- `pnpm vitest run --config vitest.unit.config.ts "src/web/auto-reply.reconnects"` (1 test passed in 17ms)
- New test: `src/web/auto-reply.reconnects-after-initial-connection-failure.test.ts` uses a mocked listenerFactory that throws `ENOTFOUND` on the first attempt, asserts a second attempt happens without propagating the error, then aborts and closes cleanly.
- `pnpm build && pnpm check && pnpm test`
## AI Assistance
- AI-assisted: yes (Codex (gpt-5.2-codex xhigh) full-auto).
- Collaboration notes:
- Claude (Opus 4.5) analyzed logs and identified the root cause in `monitorWebChannel` (the initial `await listenerFactory()` call lacked a try/catch).
- Codex CLI reviewed the root cause, implemented the fix and wrote the test.
- Claude reviewed the fix and confirmed it matched the root-cause analysis.
- Original prompt to Codex: "Fix the WhatsApp DNS reconnect bug. The issue is in src/web/auto-reply/monitor.ts around line 192 - the await listenerFactory() call needs try/catch to handle initial connection failures and continue the retry loop with backoff."
- Understanding confirmation: I understand this change catches listener startup errors, records the failure, increments reconnect attempts, waits with backoff, and retries until the max attempts is reached; the new test asserts a retry happens after an initial `ENOTFOUND`.
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR updates the WhatsApp Web reconnect logic so that failures during the *initial* listener startup are handled by the same reconnect/backoff loop as later disconnects, rather than escaping and stopping the gateway. Concretely, `monitorWebChannel` now wraps the initial `listenerFactory`/`monitorWebInbox` startup in a `try/catch`, records the error in channel status, increments `reconnectAttempts`, applies `maxAttempts`, waits using the configured backoff, and retries.
It also adds a regression test that simulates a first-attempt DNS failure (`ENOTFOUND`) from the listener factory and asserts that the reconnect loop performs a second startup attempt without propagating the initial error, then aborts cleanly.
<h3>Confidence Score: 4/5</h3>
- This PR is close to merge-ready; the runtime fix looks correct, but the new regression test is likely to be flaky in CI as written.
- The reconnect-loop change is localized and follows the existing backoff/maxAttempts flow. The main concern is the test’s dependence on a hard 200ms wall-clock polling loop with real timers, which can intermittently fail under CI load despite correct behavior.
- src/web/auto-reply.reconnects-after-initial-connection-failure.test.ts
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#17487: fix: WhatsApp connection stability - continue reconnection after ma...
by MisterGuy420 · 2026-02-15
88.1%
#9515: fix(web): retry WhatsApp 515 restart up to 3 times with delay
by Sebachowa · 2026-02-05
83.8%
#22367: fix(whatsapp): prevent permanent listener loss after abort during r...
by mcinteerj · 2026-02-21
81.2%
#22143: Fix memory leak in WhatsApp channel reconnection loop
by lancejames221b · 2026-02-20
79.2%
#16923: fix(web): resolve stale socket race condition in WhatsApp auto-reply
by dorukardahan · 2026-02-15
79.0%
#3071: fix: WhatsApp 515 error retry not triggering
by rabsef-bicrym · 2026-01-28
78.7%
#23134: fix(gateway): skip auto-restart for webhook channels that resolve i...
by puneet1409 · 2026-02-22
77.6%
#19303: Fix WhatsApp internal error leakage + cron.run timeout defaults
by koala73 · 2026-02-17
77.1%
#20395: fix(googlechat): prevent infinite auto-restart and ambiguous-target...
by ggalmeida0 · 2026-02-18
76.8%
#16628: feat(web): implement three-tier graduated retry strategy
by KrE80r · 2026-02-14
76.7%