#16923: fix(web): resolve stale socket race condition in WhatsApp auto-reply
channel: whatsapp-web
stale
size: S
Cluster:
WhatsApp Connection Stability Fixes
## Problem
When a WhatsApp message triggers AI processing (which takes 5-60+ seconds), the Baileys socket may disconnect and reconnect during that time. The reply handler's closure captures the original socket reference, so when the AI response is ready, `sendWithRetry()` tries to send on the **dead** socket.
All 3 retry attempts (500ms apart, 1.5s total window) execute against the same stale socket. Since reconnection typically takes 2+ seconds, the retries are exhausted before the new socket is available. The message is lost silently.
## Root Cause
In `monitorWebInbox()` (`src/web/inbound/monitor.ts`), the `reply` and `sendMedia` closures capture `sock` at creation time. When `monitorWebChannel()` reconnects and creates a new socket, the handler still holds a reference to the old one.
```
monitorWebChannel loop:
sock_v1 created → handler captures sock_v1
┌─ message arrives → AI processing starts (30s)
│ sock_v1 dies (keepAlive timeout)
│ sock_v2 created (reconnection)
└─ AI done → handler sends via sock_v1 (DEAD) → 3 retries → LOST
```
## Fix
**Socket Getter Pattern**: Replace direct socket capture with a shared mutable reference (`socketRef`). The reconnect loop updates `socketRef.current` whenever a new socket is created. Reply closures dereference `socketRef.current` at send time, always getting the latest live socket.
Additionally, `sendWithRetry()` gains disconnect-aware extended backoff: when a disconnect-class error is detected, the retry schedule escalates from 3×500ms (1.5s) to 6 attempts with exponential backoff (1s, 2s, 4s, 8s, 16s, 32s = ~63s total), giving reconnection ample time to complete.
**Key changes:**
- `monitorWebChannel()`: Create `socketRef = { current: null }`, update on reconnect, nullify during reconnect gap
- `monitorWebInbox()`: Accept optional `socketRef`, update with new socket, dereference in `reply`/`sendMedia`/`sendComposing` closures
- `sendWithRetry()`: Escalate to extended exponential backoff on disconnect-class errors
**63 insertions, 10 deletions across 3 files.**
## Why This Approach
| Approach | Invasiveness | Complexity | Addresses Root Cause |
|----------|-------------|------------|---------------------|
| **Socket getter (chosen)** | Low (3 files) | Simple | Yes |
| Outbound message queue | Medium (new data structure) | Queue mgmt, TTL, dedup | No (workaround) |
| Event-based reconnect wait | High (event system) | Race conditions, timeouts | Partially |
The socket getter pattern is the most minimal fix: it changes *how* the socket is referenced (by-reference instead of by-value) without adding new infrastructure.
## Backward Compatibility
- **No config changes**: No new user-facing configuration options
- **No API changes**: External behavior identical (messages delivered successfully)
- **Existing retry behavior preserved**: Non-disconnect errors still use 3×500ms
- **Optional parameter**: `socketRef` is optional — callers not passing it get existing behavior
- Lint: 0 warnings, 0 errors (oxlint --type-aware)
- TypeScript: 0 errors in modified files
## Testing
1. **Unit test**: Mock socket that throws disconnect error on first `sendMessage`, then update `socketRef.current` to a new mock, verify second attempt succeeds
2. **Integration test**: Start gateway, send message, kill socket during AI processing, verify reply delivered on reconnected socket
3. **Edge cases**: Socket null during reconnect gap (retries with backoff), multiple messages in-flight (all recover), permanent auth failure (throws after exhausting retries)
## Related Issues
- Closes #16918
- Related: #4956 (socket lifecycle management gap), #4362 (false positive delivery), #1862 (message loss during transitions), #15147 (unsynchronized delivery paths)
- Upstream: WhiskeySockets/Baileys#1963 (false positive delivery reports)
- Complements the write-ahead delivery queue from v2026.2.13 (which handles crash-recovery but not mid-session socket replacement)
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR fixes a race condition where WhatsApp auto-replies are lost when the Baileys socket disconnects during AI processing. The fix introduces a socket getter pattern where reply closures dereference a shared `socketRef.current` at send time instead of capturing the socket at creation time. When reconnection creates a new socket, in-flight retries automatically pick it up.
The retry mechanism also gains disconnect-aware exponential backoff, escalating from 3×500ms to 6 attempts with longer delays when disconnect errors are detected.
**Key improvements:**
- Socket reference pattern ensures replies always use the live socket after reconnection
- Extended retry window (up to ~63s) gives reconnection time to complete
- Null-check guards prevent crashes during reconnect gaps
- Minimal invasiveness (3 files, socket reference changes only)
- Previous thread concerns about null pointer access have been addressed in commit 9b44755
<h3>Confidence Score: 4/5</h3>
- This PR is safe to merge with minimal risk
- The fix addresses a real production issue with a well-tested pattern. The socket getter approach is clean and the null-check guards prevent crashes. Previous review concerns have been addressed. Score is 4 (not 5) due to one minor observation about the retry backoff calculation sequence.
- No files require special attention
<sub>Last reviewed commit: 9b44755</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#9727: fix(whatsapp): retry reconnect loop on initial connection failure
by luizlf · 2026-02-05
79.0%
#3071: fix: WhatsApp 515 error retry not triggering
by rabsef-bicrym · 2026-01-28
77.5%
#22143: Fix memory leak in WhatsApp channel reconnection loop
by lancejames221b · 2026-02-20
76.8%
#22367: fix(whatsapp): prevent permanent listener loss after abort during r...
by mcinteerj · 2026-02-21
76.4%
#9515: fix(web): retry WhatsApp 515 restart up to 3 times with delay
by Sebachowa · 2026-02-05
75.9%
#17487: fix: WhatsApp connection stability - continue reconnection after ma...
by MisterGuy420 · 2026-02-15
75.7%
#17326: fix(whatsapp): group composing indicator, echo prevention, and pres...
by globalcaos · 2026-02-15
75.5%
#22399: fix(web): use globalThis singleton for active-listener state
by mcinteerj · 2026-02-21
75.3%
#21463: fix(discord): prevent WebSocket death spiral + fix numeric channel ID…
by akropp · 2026-02-20
74.9%
#16655: fix(whatsapp): resolve reply-to sender E.164 for LID JIDs (have bot...
by mascarenhas · 2026-02-15
74.9%