#16196: fix(gateway): add periodic cleanup to prevent memory leak in ToolEventRecipientRegistry
gateway
stale
size: XS
Cluster:
Memory Leak Fixes and Cleanup
## Summary
Fix a potential memory leak in `ToolEventRecipientRegistry` by adding periodic cleanup of stale entries.
## Problem
The registry tracks WebSocket connections interested in tool events for specific agent runs. Each entry has a TTL of 10 minutes (or 30 seconds after finalization).
**Current behavior**: Stale entries are only cleaned up during `add()`, `get()`, or `markFinal()` operations.
**Issue**: If no new tool events arrive after agent runs complete, expired entries remain in memory indefinitely. In long-running gateway servers with intermittent activity, this could lead to unbounded memory growth.
## Solution
Add a periodic cleanup timer that runs every 60 seconds:
```typescript
const pruneTimer = setInterval(prune, TOOL_EVENT_PRUNE_INTERVAL_MS);
// Allow process exit without waiting for timer
if (pruneTimer.unref) {
pruneTimer.unref();
}
```
Also adds a `dispose()` method to the registry interface for proper cleanup during server shutdown.
## Why 60 seconds?
- Much shorter than the 10-minute TTL, ensuring timely cleanup
- Long enough to avoid excessive CPU overhead
- Aligns with typical observability/metrics intervals
## API Change
The `ToolEventRecipientRegistry` type now includes a `dispose()` method:
```typescript
export type ToolEventRecipientRegistry = {
add: (runId: string, connId: string) => void;
get: (runId: string) => ReadonlySet<string> | undefined;
markFinal: (runId: string) => void;
dispose: () => void; // NEW
};
```
Callers should call `dispose()` during graceful shutdown to clean up the timer.
## Test Plan
- [x] Verify registry entries are cleaned up after TTL expires (during idle)
- [x] Verify `dispose()` clears timer and entries
- [x] Verify `unref()` allows clean process exit
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Adds a periodic `setInterval` cleanup timer (every 60s) to `ToolEventRecipientRegistry` to prune stale entries during idle periods, preventing unbounded memory growth in long-running gateway servers. Also adds a `dispose()` method to the registry type for proper resource cleanup.
- Adds `TOOL_EVENT_PRUNE_INTERVAL_MS` (60s) constant and a `setInterval(prune, ...)` call inside `createToolEventRecipientRegistry()`
- Uses `pruneTimer.unref()` to allow clean process exit even if the timer is active
- Adds `dispose()` to the `ToolEventRecipientRegistry` type, which clears the interval and the recipients map
- The existing `prune()` logic (TTL-based expiry) is reused by the periodic timer — no new pruning semantics are introduced
<h3>Confidence Score: 4/5</h3>
- This PR is safe to merge — the change is additive, well-scoped, and the timer is properly unref'd to avoid blocking process exit.
- The implementation is correct: it reuses the existing `prune()` function, properly guards `unref()`, and adds a clean `dispose()` method. Score is 4 rather than 5 because `dispose()` is not yet called during gateway shutdown (the timer is unref'd so it won't block exit, but cleanup is incomplete). This was noted in a previous review thread.
- No files require special attention beyond the previously noted integration gap in `server-close.ts`.
<sub>Last reviewed commit: c3f12b7</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#6515: fix: in-process IPC for cron tool to avoid WS self-contention timeout
by amco3008 · 2026-02-01
73.2%
#8332: fix: add per-tool-call timeout to prevent agent hangs (v2 - fixes m...
by vishaltandale00 · 2026-02-03
72.2%
#22143: Fix memory leak in WhatsApp channel reconnection loop
by lancejames221b · 2026-02-20
71.1%
#17823: fix: memory leak in cron isolated runs — agent-events Maps never cl...
by techboss · 2026-02-16
71.0%
#19094: Fix empty tool_call_id and function names in provider transcript pa...
by yxshee · 2026-02-17
70.8%
#15996: fix(agents): messages arrive out of order — tool output beats narra...
by yinghaosang · 2026-02-14
70.8%
#6302: fix: Add timeouts to prevent indefinite hangs (issues #4954, #4956,...
by batumilove · 2026-02-01
70.4%
#22480: fix: memory leak, silent WS failures, and connection error handling
by Chase-Xuu · 2026-02-21
69.8%
#22131: fix: clear seqByRun entries in clearAgentRunContext to prevent memo...
by alanwilhelm · 2026-02-20
69.7%
#21195: fix: suppress orphaned tool_use/tool_result errors after session co...
by ruslansychov-git · 2026-02-19
69.5%