#5441: fix(android): resolve WebSocket handshake race condition (#1922)
docs
app: android
Cluster:
Gateway and TLS Enhancements
## Summary
Fixes #1922 — Android app disconnects before completing WebSocket handshake (`closed before connect` in gateway logs with code=1000, reason='bye').
## Root Cause
The connect flow relied on coordinating two separately dispatched coroutines:
1. **`onOpen`** → launches coroutine A → `awaitConnectNonce()` → `sendConnect()`
2. **`onMessage`** (connect.challenge) → launches coroutine B → completes `connectNonceDeferred`
Both coroutines are dispatched to `Dispatchers.IO` via `scope.launch`, with no guaranteed execution order. On some devices/conditions, coroutine A would fail or the dispatch timing would cause `closeQuietly()` to fire before the connect request was sent — sending `close(1000, "bye")` to the gateway.
This explains why:
- Gateway logs show `closed before connect` with `code=1000 reason=bye`
- The app never sends the `connect` request despite receiving `connect.challenge`
- The issue reproduces on LAN connections (~15-25ms connection lifetime)
- Two connections per retry cycle (one per operator/node session)
## Fix
Drive the entire connect flow from the `connect.challenge` event handler instead of from `onOpen`:
- **`onOpen`**: Now only logs + handles loopback fallback (sends connect with null nonce for localhost connections that skip the challenge)
- **`handleEvent(connect.challenge)`**: Directly launches `sendConnect(nonce)` — no cross-coroutine deferred coordination needed
- **Removed `awaitConnectNonce()`**: No longer necessary since the nonce is passed directly
Also adds `Log.d`/`Log.w` calls so connection failures are visible in logcat (previously errors in the `onOpen` catch block were silent).
## Testing
- LAN connection (non-loopback): connect.challenge triggers sendConnect directly
- Loopback connection: onOpen triggers sendConnect with null nonce (unchanged behavior)
- Multiple retry cycles: connectNonceDeferred guard prevents duplicate sendConnect calls
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This change refactors the Android gateway WebSocket connect handshake to avoid a coroutine dispatch race: `onOpen` no longer awaits a nonce and sends `connect`, and the `connect.challenge` event handler now directly calls `sendConnect(nonce)`.
The approach matches the root-cause description (eliminating cross-coroutine deferred coordination), and the added log lines should make failures more diagnosable in logcat.
Main thing to double-check is the “loopback fallback” behavior: `onOpen` currently sends `connect(null)` immediately (despite the comment saying it starts a fallback timer), and it’s not coordinated with the `connect.challenge`-driven connect path. That can lead to duplicate `connect` requests / double-completion depending on whether loopback connections can still receive a challenge on some gateways or configurations.
<h3>Confidence Score: 3/5</h3>
- This PR is likely safe to merge, but the loopback fallback path may still cause duplicate connect attempts in some configurations.
- The core change (driving connect from `connect.challenge`) addresses the described race, but `onOpen` now sends a loopback connect immediately without a real timeout and without sharing a single “connect started” guard with the challenge path. If loopback ever receives a challenge, you can end up with two concurrent `sendConnect` calls and potentially double-completion of the connect deferred.
- apps/android/app/src/main/java/ai/openclaw/android/gateway/GatewaySession.kt
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#9178: Fix: GatewayClient queueConnect() setTimeout never fires
by vishaltandale00 · 2026-02-04
76.2%
#13321: android/gateway: harden manual connect identity and A2UI UX
by m888m · 2026-02-10
76.0%
#23326: fix(daemon): graceful degradation on unsupported platforms
by indistinctchatter604 · 2026-02-22
74.3%
#22804: fix: prioritize loopback for internal gateway calls (issue #22706)
by ambicuity · 2026-02-21
74.0%
#6302: fix: Add timeouts to prevent indefinite hangs (issues #4954, #4956,...
by batumilove · 2026-02-01
73.9%
#6466: fix(gateway): add handshake timeout and connection error handling
by jarvis-raven · 2026-02-01
73.1%
#21450: Android: allow HTTP for LAN hosts
by pedrochagasmaster · 2026-02-19
73.1%
#22056: fix(gateway): use loopback for self-connections regardless of bind ...
by usedhonda · 2026-02-20
73.0%
#19026: fix(gateway): use loopback for local CLI-to-gateway connections
by Phineas1500 · 2026-02-17
72.9%
#11205: Android: fix gateway connection and canvas URL for Tailscale serve
by emonty · 2026-02-07
72.4%