← Back to PRs

#22719: fix(agents): make subagent announce timeout configurable (restore 60s default)

by Valadon open 2026-02-21 15:16 View on GitHub →
agents size: S
## Summary Describe the problem and fix in 2–5 bullets: - Problem: commit `b8f66c260` (from #14447) reduced subagent announce gateway timeouts from `60_000` to hardcoded `15_000`, which can cause false cron announce delivery failures on slower-but-valid runs. - Why it matters: isolated cron jobs can complete useful work but still be marked failed when announce delivery exceeds 15s. - What changed: added optional config `agents.defaults.subagents.announceTimeoutMs` and replaced the three hardcoded `15_000` announce call timeouts in `src/agents/subagent-announce.ts` with a config-resolved timeout. - Default behavior restored: when unset/invalid, announce timeout is now `60_000` (pre-regression behavior). - What did NOT change (scope boundary): no retry/backoff, no dedupe/stale-gate logic, no cron orchestration redesign, no CLI additions. ## Change Type (select all) - [x] Bug fix - [ ] Feature - [ ] Refactor - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [x] Gateway / orchestration - [ ] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [ ] Integrations - [x] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Closes #22027 - Related #22236 - Related #21041 - Related #22000 - Related #17756 - Related #15259 ### Related open PRs (distinct but complementary scope) - #17001 — broader retry/backoff + timeout behavior changes; this PR is intentionally narrower (timeout configurability + default restoration only). - #17028 — retry-focused approach; this PR does not add retries and can coexist. - #20328 — large reliability bundle (persistence/recovery/CLI); this PR isolates the regression fix to keep review surface minimal. - #22411 — draft race/dedupe mitigation; complementary to this config/default fix and non-conflicting in intent. ## User-visible / Behavior Changes - New optional config key: `agents.defaults.subagents.announceTimeoutMs`. - Subagent announce timeout default is `60_000` when key is unset. - If configured, announce timeout uses the configured positive integer (with safe bounds/clamping). ## Security Impact (required) - New permissions/capabilities? (`No`) - Secrets/tokens handling changed? (`No`) - New/changed network calls? (`No`) - Command/tool execution surface changed? (`No`) - Data access scope changed? (`No`) - If any `Yes`, explain risk + mitigation: `N/A` ## Repro + Verification ### Environment - OS: macOS (Darwin arm64) - Runtime/container: Node + pnpm workspace - Model/provider: N/A (config/announce timeout path) - Integration/channel (if any): cron isolated-agent announce flow - Relevant config (redacted): - default path (unset key) - override path: `agents.defaults.subagents.announceTimeoutMs: 90000` ### Steps 1. Run an isolated cron flow that routes through `runSubagentAnnounceFlow(... waitForCompletion:false ...)` and announce delivery. 2. Observe announce gateway call timeout behavior with key unset. 3. Set `agents.defaults.subagents.announceTimeoutMs` and repeat. ### Expected - Unset key uses `60_000` timeout. - Configured key is honored for both direct `agent` announce and completion `send` announce paths. ### Actual - Verified by tests: default and configured timeout are both honored. ## Evidence Attach at least one: - [x] Failing test/log before + passing after - [x] Trace/log snippets - [ ] Screenshot/recording - [ ] Perf numbers (if relevant) Evidence included in branch: - New test file: `src/agents/subagent-announce.timeout.test.ts` - `uses 60s timeout by default for direct announce agent call` - `honors configured announce timeout for direct announce agent call` - `honors configured announce timeout for completion direct send call` - Full gate run command used: - `pnpm install && pnpm build && pnpm check && pnpm test` ## Human Verification (required) What you personally verified (not just CI), and how: - Verified scenarios: - all three prior hardcoded `15_000` announce callsites now use config-resolved timeout - default fallback is `60_000` - configured override is honored for both `agent` and `send` delivery paths - Edge cases checked: - unset config - non-finite/invalid values fallback to default - bounds clamped to timer-safe range - What you did **not** verify: - retry semantics or dedupe behaviors (explicitly out of scope) ## Compatibility / Migration - Backward compatible? (`Yes`) - Config/env changes? (`Yes`, optional new config key) - Migration needed? (`No`) - If yes, exact upgrade steps: `N/A` ## Failure Recovery (if this breaks) - How to disable/revert this change quickly: - revert this PR commit(s), or remove the optional key to use default behavior. - Files/config to restore: - `src/agents/subagent-announce.ts` - `src/config/types.agent-defaults.ts` - `src/config/zod-schema.agent-defaults.ts` - Known bad symptoms reviewers should watch for: - unexpected announce timeout regressions or invalid timeout parsing. ## Risks and Mitigations - Risk: longer wait before timeout compared to 15s when upstream call is truly hung. - Mitigation: timeout remains bounded/configurable; default restores previously shipped 60s behavior. ## AI-Assisted Disclosure AI-assisted: Yes (research + drafting + implementation support). Human-verified: code paths, scope boundaries, and local test/build/check execution. <!-- greptile_comment --> <h3>Greptile Summary</h3> This PR addresses a regression from commit `b8f66c260` (#14447) that reduced subagent announce gateway timeouts from 60s to a hardcoded 15s, causing false failures for slower-but-valid cron jobs. The fix adds a new optional config field `agents.defaults.subagents.announceTimeoutMs` with a 60s default, replacing three hardcoded 15s timeout values in `subagent-announce.ts`. The implementation includes proper validation (finite number check, clamping to timer-safe bounds), comprehensive test coverage, and maintains full backward compatibility. - Restored 60s default announce timeout (pre-regression behavior) - Made timeout configurable via `agents.defaults.subagents.announceTimeoutMs` - Replaced all three hardcoded `15_000` timeout callsites - Added comprehensive unit tests covering default and configured timeout scenarios - Proper bounds checking: clamps to `[1, 2_147_000_000]` to avoid timer issues <h3>Confidence Score: 5/5</h3> - This PR is safe to merge with minimal risk - The change is a straightforward bug fix with excellent implementation quality: adds a single config field with proper validation, replaces three hardcoded timeout values with config-resolved values, includes comprehensive test coverage, maintains full backward compatibility (defaults to 60s when unset), and uses safe bounds checking to prevent timer overflow issues. The scope is intentionally narrow and well-documented. - No files require special attention <sub>Last reviewed commit: 8ec3c5b</sub> <!-- greptile_other_comments_section --> <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub> <!-- /greptile_comment -->

Most Similar PRs