← Back to PRs

#5441: fix(android): resolve WebSocket handshake race condition (#1922)

by cortexuvula open 2026-01-31 14:01 View on GitHub →
docs app: android
## Summary Fixes #1922 — Android app disconnects before completing WebSocket handshake (`closed before connect` in gateway logs with code=1000, reason='bye'). ## Root Cause The connect flow relied on coordinating two separately dispatched coroutines: 1. **`onOpen`** → launches coroutine A → `awaitConnectNonce()` → `sendConnect()` 2. **`onMessage`** (connect.challenge) → launches coroutine B → completes `connectNonceDeferred` Both coroutines are dispatched to `Dispatchers.IO` via `scope.launch`, with no guaranteed execution order. On some devices/conditions, coroutine A would fail or the dispatch timing would cause `closeQuietly()` to fire before the connect request was sent — sending `close(1000, "bye")` to the gateway. This explains why: - Gateway logs show `closed before connect` with `code=1000 reason=bye` - The app never sends the `connect` request despite receiving `connect.challenge` - The issue reproduces on LAN connections (~15-25ms connection lifetime) - Two connections per retry cycle (one per operator/node session) ## Fix Drive the entire connect flow from the `connect.challenge` event handler instead of from `onOpen`: - **`onOpen`**: Now only logs + handles loopback fallback (sends connect with null nonce for localhost connections that skip the challenge) - **`handleEvent(connect.challenge)`**: Directly launches `sendConnect(nonce)` — no cross-coroutine deferred coordination needed - **Removed `awaitConnectNonce()`**: No longer necessary since the nonce is passed directly Also adds `Log.d`/`Log.w` calls so connection failures are visible in logcat (previously errors in the `onOpen` catch block were silent). ## Testing - LAN connection (non-loopback): connect.challenge triggers sendConnect directly - Loopback connection: onOpen triggers sendConnect with null nonce (unchanged behavior) - Multiple retry cycles: connectNonceDeferred guard prevents duplicate sendConnect calls <!-- greptile_comment --> <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This change refactors the Android gateway WebSocket connect handshake to avoid a coroutine dispatch race: `onOpen` no longer awaits a nonce and sends `connect`, and the `connect.challenge` event handler now directly calls `sendConnect(nonce)`. The approach matches the root-cause description (eliminating cross-coroutine deferred coordination), and the added log lines should make failures more diagnosable in logcat. Main thing to double-check is the “loopback fallback” behavior: `onOpen` currently sends `connect(null)` immediately (despite the comment saying it starts a fallback timer), and it’s not coordinated with the `connect.challenge`-driven connect path. That can lead to duplicate `connect` requests / double-completion depending on whether loopback connections can still receive a challenge on some gateways or configurations. <h3>Confidence Score: 3/5</h3> - This PR is likely safe to merge, but the loopback fallback path may still cause duplicate connect attempts in some configurations. - The core change (driving connect from `connect.challenge`) addresses the described race, but `onOpen` now sends a loopback connect immediately without a real timeout and without sharing a single “connect started” guard with the challenge path. If loopback ever receives a challenge, you can end up with two concurrent `sendConnect` calls and potentially double-completion of the connect deferred. - apps/android/app/src/main/java/ai/openclaw/android/gateway/GatewaySession.kt <!-- greptile_other_comments_section --> <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub> <!-- /greptile_comment -->

Most Similar PRs