#19636: fix(agents): harden overflow recovery observability + subagent terminal fallback
agents
size: L
Cluster:
Surrogate Pair Handling Fixes
## Summary
- Fix deterministic overflow fallback behavior when error payloads are missing or normalize to empty.
- Add structured `overflow.recovery` diagnostics to make branch/outcome triage queryable.
- Surface a terminal, idempotent requester callback when subagent completion announce retries give up.
- Improve follow-up phrase detection (`"I'll report back"` variants) so interim updates are not misclassified as final.
## Change Type
- [x] Bug fix
- [ ] Feature
- [x] Refactor
- [ ] Docs
- [ ] Security hardening
- [ ] Chore/infra
## Scope
- [x] Gateway / orchestration
- [x] Skills / tool execution
- [ ] Auth / tokens
- [x] Memory / storage
- [ ] Integrations
- [x] API / contracts
- [ ] UI / DX
- [ ] CI/CD / infra
## Linked Issue
- Closes #19629
## Why this is still needed after v2026.2.17
Upstream includes multiple subagent announce improvements; this PR keeps only additive behavior not present on current `upstream/main` markers:
- requester-visible give-up callback fallback + idempotent persistence markers
- overflow deterministic fallback coverage + observability assertions across targeted paths
- follow-up phrase handling for `will report back` variants
## User-visible / Behavior Changes
- On terminal subagent announce give-up, requester now receives an explicit system fallback callback (once).
- Overflow recovery remains functionally stable but now has better deterministic fallback and diagnostics classification.
- No new CLI command/config surface.
## Security Impact
- New permissions/capabilities? No
- Secrets/tokens handling changed? No
- New/changed network calls? No
- Command/tool execution surface changed? No
- Data access scope changed? No
## Repro + Verification
### Environment
- OS: Ubuntu 25.10
- Runtime: Node 22 + pnpm workspace checkout
- Repo state: clean, branch `fix/overflow-recovery-observability-wave1`
### Verification commands (passing)
```bash
pnpm check
pnpm build
pnpm vitest run --config vitest.unit.config.ts \
src/auto-reply/reply/agent-runner.runreplyagent.test.ts \
src/auto-reply/reply/reply-state.test.ts \
src/infra/infra-store.test.ts \
src/agents/subagent-announce-queue.test.ts \
src/agents/subagent-registry.announce-loop-guard.test.ts \
src/agents/subagent-registry.steer-restart.test.ts \
src/agents/subagent-registry.nested.test.ts \
src/cron/isolated-agent/subagent-followup.test.ts
pnpm vitest run --config vitest.e2e.config.ts \
src/agents/pi-embedded-runner/run.overflow-compaction.e2e.test.ts \
src/agents/subagent-registry.persistence.e2e.test.ts
pnpm vitest run --config vitest.unit.config.ts \
src/browser/server.auth-token-gates-http.test.ts \
src/web/media.test.ts
pnpm vitest run --config vitest.gateway.config.ts \
src/gateway/auth.test.ts \
src/gateway/net.test.ts
```
### Controlled live smoke (post-restart)
- Session: `smoke-post-restart-1771388006`
- Run ID: `cdcd0950-ba6f-4f4b-a326-91d038b2433f`
- Output: `ok`
- Gateway remained healthy and RPC reachable.
## Evidence
- [x] Targeted tests before/after behavior assertions
- [x] Runtime/log smoke evidence
- [ ] Screenshot/recording
- [ ] Perf benchmark
## Human Verification
Personally verified:
- deterministic overflow fallback assertions pass
- overflow diagnostics events emitted/queried in infra-store tests
- retry-budget give-up callback persistence/idempotency behavior in subagent loop-guard tests
- post-restart live agent run succeeded end-to-end
Not fully verified locally:
- full `pnpm test` suite (intentionally skipped under server-safe policy; upstream CI is expected full-suite gate)
## Compatibility / Migration
- Backward compatible: Yes
- Config/env changes required: No
- Migration needed: No
## Failure Recovery
- Revert this PR commits in order if needed.
- Primary touched modules:
- `src/auto-reply/reply/agent-runner.ts`
- `src/infra/diagnostic-events.ts`
- `src/agents/subagent-registry.ts`
- `src/cron/isolated-agent/subagent-followup.ts`
## Risks and Mitigations
- Risk: extra diagnostics volume
- Mitigation: structured events and targeted assertions; no new external call surface.
- Risk: callback duplication noise
- Mitigation: persisted `announceGiveUpNotifiedAt` + idempotency key.
## AI-assisted disclosure
- [x] AI-assisted contribution
- Testing level: fully tested on targeted + sentinel suites, build pass, controlled live smoke pass
- Confirmed understanding of changed code paths and behavior
Most Similar PRs
#19551: fix(reply): make overflow fallback deterministic for empty recovery...
by Jackten · 2026-02-17
81.8%
#20328: fix(agents): Add retry with exponential backoff for subagent announ...
by tiny-ship-it · 2026-02-18
77.7%
#22719: fix(agents): make subagent announce timeout configurable (restore 6...
by Valadon · 2026-02-21
75.9%
#9049: fix: prevent subagent stuck loops and ensure user feedback
by maxtongwang · 2026-02-04
75.3%
#23166: fix(agents): restore subagent announce chain from #22223
by tyler6204 · 2026-02-22
75.2%
#15859: Graceful fallback + transparent model-failure logging
by wboudy · 2026-02-14
75.1%
#13105: fix: debounce subagent lifecycle events to prevent premature announ...
by mcaxtr · 2026-02-10
74.4%
#10273: fix(agents): detect and auto-compact mid-run context overflow
by terryops · 2026-02-06
73.6%
#21561: runner: add usage preflight guard for near-limit requests
by VontaJamal · 2026-02-20
73.5%
#18468: fix(agents): prevent infinite retry loops in sub-agent completion a...
by BinHPdev · 2026-02-16
73.2%