#22926: feat(gateway): add Windows-native watch DX and tool/channel observability
docs
gateway
scripts
size: XL
Cluster:
Windows Gateway Enhancements
## Summary
- Problem: native Windows dev flow was under-documented and gateway watch mode could clash with scheduled daemon state; Gateway also lacked first-class per-tool/per-channel latency usage visibility for bottleneck analysis.
- Why it matters: contributors on Windows hit avoidable startup friction, and operators could not quickly detect high-latency tools/channels from Gateway runtime signals.
- What changed:
- Added `scripts/gateway-watch.mjs` and switched `pnpm gateway:watch` to it.
- On native Windows, watch mode now runs a best-effort `openclaw gateway stop` then starts a compatible command (`gateway run --bind loopback --port 18789 --allow-unconfigured`).
- Added PowerShell-focused smoke/unit tests for watch behavior.
- Added Gateway tool observability aggregation (`src/gateway/tool-observability.ts`) tracking calls/errors/latency by tool and channel.
- Wired metrics collection into `/tools/invoke` and exposed snapshot via `usage.gatewayToolMetrics`.
- Added/updated tests for observability and usage handler integration.
- Added architecture Mermaid map + first PR runbook docs and linked docs directory/README/debugging/setup updates.
- What did NOT change: no changes to transport auth model, no changes to channel runtime behavior, no persistent telemetry storage introduced (in-memory snapshot only).
## Change Type (select all)
- [ ] Bug fix
- [x] Feature
- [ ] Refactor
- [x] Docs
- [ ] Security hardening
- [ ] Chore/infra
## Scope (select all touched areas)
- [x] Gateway / orchestration
- [x] Skills / tool execution
- [ ] Auth / tokens
- [ ] Memory / storage
- [ ] Integrations
- [x] API / contracts
- [x] UI / DX
- [x] CI/CD / infra
## Linked Issue/PR
- Related #22750
- Related #22873
## User-visible / Behavior Changes
- `pnpm gateway:watch` now uses a Windows-safe native flow when `process.platform === "win32"`.
- New Gateway RPC method: `usage.gatewayToolMetrics` for usage/latency by tool and channel.
- Docs include an expanded architecture Mermaid and a contributor first-PR runbook.
## Security Impact (required)
- New permissions/capabilities? **No**
- Secrets/tokens handling changed? **No**
- New/changed network calls? **No**
- Command/tool execution surface changed? **Yes** (watch wrapper behavior only)
- Data access scope changed? **No**
- If any `Yes`, explain risk + mitigation: watch wrapper uses existing local CLI command (`gateway stop`) best-effort, then existing gateway run invocation; no new remote surface.
## Repro + Verification
### Steps
1. Run `pnpm gateway:watch` on Windows native PowerShell.
2. Run `tools/invoke` with channel header and mixed tool outcomes.
3. Call `usage.gatewayToolMetrics` via Gateway WS RPC.
### Expected
- Watch mode starts reliably without daemon/port conflict loops on native Windows.
- Metrics snapshot returns aggregated per-tool/per-channel calls/errors/latency.
### Actual
- Verified with targeted unit/smoke tests.
### Evidence
- `pnpm vitest run src/infra/gateway-watch.test.ts src/infra/gateway-watch.powershell-smoke.test.ts src/gateway/tool-observability.test.ts src/gateway/server-methods/usage.sessions-usage.test.ts src/gateway/tools-invoke-http.test.ts src/gateway/method-scopes.test.ts`
## Human Verification (required)
- Verified scenarios:
- Windows watch command construction + daemon stop pre-step behavior.
- Tool metrics aggregation and Gateway usage method response shape.
- Method scopes include new usage method.
- Edge cases checked:
- Generic/agent session keys channel fallback to `unknown`.
- Mixed success/error tool invocations latency aggregation.
- What you did not verify:
- Full end-to-end manual run on a live Windows host in this environment.
## Compatibility / Migration
- Backward compatible? **Yes**
- Config/env changes? **No**
- Migration needed? **No**
## Failure Recovery (if this breaks)
- Revert command behavior quickly:
- restore `package.json` `gateway:watch` to `node scripts/watch-node.mjs gateway --force`
- Files/config to restore:
- `package.json`, `scripts/gateway-watch.mjs`, `src/gateway/tool-observability.ts`, `src/gateway/tools-invoke-http.ts`, `src/gateway/server-methods/usage.ts`
- Known bad symptoms reviewers should watch for:
- missing `usage.gatewayToolMetrics` method classification
- watch-mode regressions on non-Windows platforms
## Risks and Mitigations
- Risk: in-memory metrics can reset on restart and may be misunderstood as historical analytics.
- Mitigation: docs explicitly frame it as runtime observability snapshot, not persisted telemetry.
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR adds Windows-native watch mode support and Gateway tool observability to improve developer experience and operational visibility.
**Key changes:**
- Added `scripts/gateway-watch.mjs` wrapper that detects Windows platform and uses a compatible command flow (`gateway run` instead of `--force` flag which relies on Unix tools like `lsof`)
- Windows watch mode now runs best-effort `gateway stop` before starting to avoid daemon port conflicts
- Added in-memory tool observability tracking (`src/gateway/tool-observability.ts`) that aggregates calls, errors, and latency metrics by tool and channel
- Wired metrics collection into `/tools/invoke` HTTP handler and exposed via new Gateway RPC method `usage.gatewayToolMetrics`
- Added comprehensive test coverage for watch behavior and observability aggregation
- Added architecture Mermaid diagram and first-PR runbook documentation
- Updated docs (README, setup, debugging) to document Windows-native flow
**Implementation quality:**
- Clean separation of concerns: watch wrapper delegates to existing `runWatchMain`, observability module is standalone
- Good defensive coding: Windows daemon stop is wrapped in try-catch (best-effort), channel resolution has sensible fallback to `unknown`
- Type-safe with proper TypeScript definitions in `src/infra/scripts-modules.d.ts`
- Test coverage includes unit tests, smoke tests, and integration tests for the usage handler
- In-memory metrics are explicitly documented as runtime snapshots, not persistent telemetry
<h3>Confidence Score: 5/5</h3>
- This PR is safe to merge with minimal risk
- The implementation is well-tested with comprehensive unit and integration tests, follows existing code patterns, has good error handling (best-effort daemon stop, channel fallbacks), maintains backward compatibility (additive changes only), and is properly documented. The changes are isolated to watch mode wrapper and observability tracking without touching core gateway or tool execution logic.
- No files require special attention
<sub>Last reviewed commit: 781875f</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#22154: dev(watch): make gateway watch portable on native Windows
by Kansodata · 2026-02-20
92.0%
#22091: docs(readme): add Windows-native dev gateway command
by Kansodata · 2026-02-20
88.1%
#11470: fix: prevent gateway:watch race by passing --no-clean to tsdown
by apetresc · 2026-02-07
75.3%
#19255: feat(gateway): add WebSocket connection metrics monitoring
by Wike-CHI · 2026-02-17
74.9%
#21120: Security/Gateway: guard dangerous HTTP /tools/invoke re-enables
by bmendonca3 · 2026-02-19
74.2%
#11974: [FEATURE] feat: integrate systemd WatchdogSec for gateway hang dete...
by mcaxtr · 2026-02-08
74.1%
#17054: Gateway: add commands.list method
by advaitpaliwal · 2026-02-15
73.3%
#12234: gateway: incident tracking, recover command, and ciao ERR_SERVER_CL...
by levineam · 2026-02-09
72.9%
#23814: Gateway: block unauthenticated tool-invocation HTTP surfaces
by bmendonca3 · 2026-02-22
72.5%
#23364: Gateway: add risk-ack interlock for dangerous Control UI flags
by bmendonca3 · 2026-02-22
72.2%