← Back to PRs

#16923: fix(web): resolve stale socket race condition in WhatsApp auto-reply

by dorukardahan open 2026-02-15 07:41 View on GitHub →
channel: whatsapp-web stale size: S
## Problem When a WhatsApp message triggers AI processing (which takes 5-60+ seconds), the Baileys socket may disconnect and reconnect during that time. The reply handler's closure captures the original socket reference, so when the AI response is ready, `sendWithRetry()` tries to send on the **dead** socket. All 3 retry attempts (500ms apart, 1.5s total window) execute against the same stale socket. Since reconnection typically takes 2+ seconds, the retries are exhausted before the new socket is available. The message is lost silently. ## Root Cause In `monitorWebInbox()` (`src/web/inbound/monitor.ts`), the `reply` and `sendMedia` closures capture `sock` at creation time. When `monitorWebChannel()` reconnects and creates a new socket, the handler still holds a reference to the old one. ``` monitorWebChannel loop: sock_v1 created → handler captures sock_v1 ┌─ message arrives → AI processing starts (30s) │ sock_v1 dies (keepAlive timeout) │ sock_v2 created (reconnection) └─ AI done → handler sends via sock_v1 (DEAD) → 3 retries → LOST ``` ## Fix **Socket Getter Pattern**: Replace direct socket capture with a shared mutable reference (`socketRef`). The reconnect loop updates `socketRef.current` whenever a new socket is created. Reply closures dereference `socketRef.current` at send time, always getting the latest live socket. Additionally, `sendWithRetry()` gains disconnect-aware extended backoff: when a disconnect-class error is detected, the retry schedule escalates from 3×500ms (1.5s) to 6 attempts with exponential backoff (1s, 2s, 4s, 8s, 16s, 32s = ~63s total), giving reconnection ample time to complete. **Key changes:** - `monitorWebChannel()`: Create `socketRef = { current: null }`, update on reconnect, nullify during reconnect gap - `monitorWebInbox()`: Accept optional `socketRef`, update with new socket, dereference in `reply`/`sendMedia`/`sendComposing` closures - `sendWithRetry()`: Escalate to extended exponential backoff on disconnect-class errors **63 insertions, 10 deletions across 3 files.** ## Why This Approach | Approach | Invasiveness | Complexity | Addresses Root Cause | |----------|-------------|------------|---------------------| | **Socket getter (chosen)** | Low (3 files) | Simple | Yes | | Outbound message queue | Medium (new data structure) | Queue mgmt, TTL, dedup | No (workaround) | | Event-based reconnect wait | High (event system) | Race conditions, timeouts | Partially | The socket getter pattern is the most minimal fix: it changes *how* the socket is referenced (by-reference instead of by-value) without adding new infrastructure. ## Backward Compatibility - **No config changes**: No new user-facing configuration options - **No API changes**: External behavior identical (messages delivered successfully) - **Existing retry behavior preserved**: Non-disconnect errors still use 3×500ms - **Optional parameter**: `socketRef` is optional — callers not passing it get existing behavior - Lint: 0 warnings, 0 errors (oxlint --type-aware) - TypeScript: 0 errors in modified files ## Testing 1. **Unit test**: Mock socket that throws disconnect error on first `sendMessage`, then update `socketRef.current` to a new mock, verify second attempt succeeds 2. **Integration test**: Start gateway, send message, kill socket during AI processing, verify reply delivered on reconnected socket 3. **Edge cases**: Socket null during reconnect gap (retries with backoff), multiple messages in-flight (all recover), permanent auth failure (throws after exhausting retries) ## Related Issues - Closes #16918 - Related: #4956 (socket lifecycle management gap), #4362 (false positive delivery), #1862 (message loss during transitions), #15147 (unsynchronized delivery paths) - Upstream: WhiskeySockets/Baileys#1963 (false positive delivery reports) - Complements the write-ahead delivery queue from v2026.2.13 (which handles crash-recovery but not mid-session socket replacement) <!-- greptile_comment --> <h3>Greptile Summary</h3> This PR fixes a race condition where WhatsApp auto-replies are lost when the Baileys socket disconnects during AI processing. The fix introduces a socket getter pattern where reply closures dereference a shared `socketRef.current` at send time instead of capturing the socket at creation time. When reconnection creates a new socket, in-flight retries automatically pick it up. The retry mechanism also gains disconnect-aware exponential backoff, escalating from 3×500ms to 6 attempts with longer delays when disconnect errors are detected. **Key improvements:** - Socket reference pattern ensures replies always use the live socket after reconnection - Extended retry window (up to ~63s) gives reconnection time to complete - Null-check guards prevent crashes during reconnect gaps - Minimal invasiveness (3 files, socket reference changes only) - Previous thread concerns about null pointer access have been addressed in commit 9b44755 <h3>Confidence Score: 4/5</h3> - This PR is safe to merge with minimal risk - The fix addresses a real production issue with a well-tested pattern. The socket getter approach is clean and the null-check guards prevent crashes. Previous review concerns have been addressed. Score is 4 (not 5) due to one minor observation about the retry backoff calculation sequence. - No files require special attention <sub>Last reviewed commit: 9b44755</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs