#19636: fix(agents): harden overflow recovery observability + subagent terminal fallback

by Jackten open 2026-02-18 01:37 View on GitHub →

agents size: L

## Summary - Fix deterministic overflow fallback behavior when error payloads are missing or normalize to empty. - Add structured `overflow.recovery` diagnostics to make branch/outcome triage queryable. - Surface a terminal, idempotent requester callback when subagent completion announce retries give up. - Improve follow-up phrase detection (`"I'll report back"` variants) so interim updates are not misclassified as final. ## Change Type - [x] Bug fix - [ ] Feature - [x] Refactor - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope - [x] Gateway / orchestration - [x] Skills / tool execution - [ ] Auth / tokens - [x] Memory / storage - [ ] Integrations - [x] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue - Closes #19629 ## Why this is still needed after v2026.2.17 Upstream includes multiple subagent announce improvements; this PR keeps only additive behavior not present on current `upstream/main` markers: - requester-visible give-up callback fallback + idempotent persistence markers - overflow deterministic fallback coverage + observability assertions across targeted paths - follow-up phrase handling for `will report back` variants ## User-visible / Behavior Changes - On terminal subagent announce give-up, requester now receives an explicit system fallback callback (once). - Overflow recovery remains functionally stable but now has better deterministic fallback and diagnostics classification. - No new CLI command/config surface. ## Security Impact - New permissions/capabilities? No - Secrets/tokens handling changed? No - New/changed network calls? No - Command/tool execution surface changed? No - Data access scope changed? No ## Repro + Verification ### Environment - OS: Ubuntu 25.10 - Runtime: Node 22 + pnpm workspace checkout - Repo state: clean, branch `fix/overflow-recovery-observability-wave1` ### Verification commands (passing) ```bash pnpm check pnpm build pnpm vitest run --config vitest.unit.config.ts \ src/auto-reply/reply/agent-runner.runreplyagent.test.ts \ src/auto-reply/reply/reply-state.test.ts \ src/infra/infra-store.test.ts \ src/agents/subagent-announce-queue.test.ts \ src/agents/subagent-registry.announce-loop-guard.test.ts \ src/agents/subagent-registry.steer-restart.test.ts \ src/agents/subagent-registry.nested.test.ts \ src/cron/isolated-agent/subagent-followup.test.ts pnpm vitest run --config vitest.e2e.config.ts \ src/agents/pi-embedded-runner/run.overflow-compaction.e2e.test.ts \ src/agents/subagent-registry.persistence.e2e.test.ts pnpm vitest run --config vitest.unit.config.ts \ src/browser/server.auth-token-gates-http.test.ts \ src/web/media.test.ts pnpm vitest run --config vitest.gateway.config.ts \ src/gateway/auth.test.ts \ src/gateway/net.test.ts ``` ### Controlled live smoke (post-restart) - Session: `smoke-post-restart-1771388006` - Run ID: `cdcd0950-ba6f-4f4b-a326-91d038b2433f` - Output: `ok` - Gateway remained healthy and RPC reachable. ## Evidence - [x] Targeted tests before/after behavior assertions - [x] Runtime/log smoke evidence - [ ] Screenshot/recording - [ ] Perf benchmark ## Human Verification Personally verified: - deterministic overflow fallback assertions pass - overflow diagnostics events emitted/queried in infra-store tests - retry-budget give-up callback persistence/idempotency behavior in subagent loop-guard tests - post-restart live agent run succeeded end-to-end Not fully verified locally: - full `pnpm test` suite (intentionally skipped under server-safe policy; upstream CI is expected full-suite gate) ## Compatibility / Migration - Backward compatible: Yes - Config/env changes required: No - Migration needed: No ## Failure Recovery - Revert this PR commits in order if needed. - Primary touched modules: - `src/auto-reply/reply/agent-runner.ts` - `src/infra/diagnostic-events.ts` - `src/agents/subagent-registry.ts` - `src/cron/isolated-agent/subagent-followup.ts` ## Risks and Mitigations - Risk: extra diagnostics volume - Mitigation: structured events and targeted assertions; no new external call surface. - Risk: callback duplication noise - Mitigation: persisted `announceGiveUpNotifiedAt` + idempotency key. ## AI-assisted disclosure - [x] AI-assisted contribution - Testing level: fully tested on targeted + sentinel suites, build pass, controlled live smoke pass - Confirmed understanding of changed code paths and behavior