#12983: fix(gateway): defer seq increment until after dropIfSlow filtering
gateway
stale
Cluster:
Gateway Error Handling Improvements
## Problem
Dashboard shows recurring "event gap detected" errors because the broadcast `seq`
counter increments before the per-client send loop. When `dropIfSlow` skips slow
clients, the seq is consumed but the frame is never delivered, causing clients to
see non-contiguous sequence numbers. Fixes #12895
## Solution
Restructure `broadcastInternal()` into a two-pass approach:
1. **Pass 1**: Collect eligible recipients, closing slow consumers along the way
2. **Pass 2**: Only increment `seq` and serialize the frame if at least one client
will receive it
The log metadata now distinguishes `"targeted"` (targeted events) from `"dropped"`
(broadcast with no eligible recipients) for better observability.
## Testing
- [ ] `pnpm build` passes
- [ ] `pnpm check` passes
- [ ] Added test: dropIfSlow does not increment seq when all clients are slow
- [ ] Added test: mixed fast/slow clients receive contiguous seq
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR changes the gateway WebSocket broadcaster to avoid consuming a broadcast sequence number when no clients will actually receive the frame (e.g., when all recipients are filtered out by `dropIfSlow`). It does this by splitting `broadcastInternal()` into two passes: first selecting eligible recipients (and filtering/handling slow consumers), then allocating `seq` and serializing the frame only if at least one recipient exists. Tests were added to assert that `seq` is not incremented when all clients are slow with `dropIfSlow`, and that fast clients still observe contiguous `seq` values when mixed with slow clients.
<h3>Confidence Score: 3/5</h3>
- This PR is close to safe to merge, but there is a behavior change around handling slow clients when dropIfSlow is enabled that should be confirmed/fixed before merging.
- The seq allocation change and added tests align with the stated bug. However, the new first-pass logic skips slow clients when `dropIfSlow` is true without closing them, whereas previously slow consumers were closed; if the prior behavior was relied on for backpressure enforcement, this is a functional regression. Tooling/tests were not runnable in this environment (pnpm unavailable), so confidence is reduced.
- src/gateway/server-broadcast.ts
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#16994: fix(gateway): prevent double terminal SSE event on OpenResponses error
by AI-Reviewer-QS · 2026-02-15
72.0%
#9006: fix: streaming UI, session locks, routing performance, plugin sandb...
by facundollamas2007 · 2026-02-04
70.7%
#14811: feat(gateway): route chat/agent events per-connection instead of glob…
by jiangjin11 · 2026-02-12
70.5%
#12999: feat(agents): Add streaming response metrics tracking
by trevorgordon981 · 2026-02-10
70.3%
#4300: Gateway: prevent OpenAI-compatible client crash on SSE termination
by perryraskin · 2026-01-30
70.3%
#9178: Fix: GatewayClient queueConnect() setTimeout never fires
by vishaltandale00 · 2026-02-04
70.0%
#12240: fix: suppress heartbeat agent events from webchat broadcast
by Yida-Dev · 2026-02-09
70.0%
#8352: fix(gateway): include clientRunId in agent event payloads
by MarvinDontPanic · 2026-02-03
69.8%
#11472: fix: retry media fetch on transient network errors
by openclaw-quenio · 2026-02-07
69.6%
#21462: fix(agents): hold back partial NO_REPLY token in pi-embedded streaming
by algal · 2026-02-20
69.5%