← Back to PRs

#17445: fix(pi-embedded): add aggregate timeout to compaction retry + harden SIGUSR1 restart

by joeykrug open 2026-02-15 19:24 View on GitHub →
cli agents stale size: M
Fixes #17444 ## Summary This PR addresses two related "session looks hung" failure modes in the PI embedded runner: 1. **Compaction retry waits can block the session lane with no aggregate timeout**. 2. **SIGUSR1 in-process restart can proceed while embedded compaction is still in-flight**, leaving session write locks held by the previous lifecycle. ## What changed ### 1) Bound compaction retry waits (aggregate timeout) - Add `waitForCompactionRetryWithAggregateTimeout()` in `run/compaction-retry-aggregate-timeout.ts`. - In `run/attempt.ts`, replace the unbounded `await abortable(waitForCompactionRetry())` with a **60s aggregate timeout**. - On aggregate timeout: - set `timedOutDuringCompaction = true` - log a warning and proceed using the existing pre-compaction snapshot selection logic - Defensive normalization: non-finite `aggregateTimeoutMs` values fall back to 1ms rather than producing `NaN` timeouts. - Timer cleanup uses `timer !== undefined` (not truthiness) to handle fake-timer environments where handle can be `0`. This prevents long compaction retry waits from freezing the session lane. ### 2) Harden SIGUSR1 restart against in-flight embedded runs/compaction - Increase default SIGUSR1 deferral budget: `DEFAULT_DEFERRAL_MAX_WAIT_MS` **30s → 90s**. - Increase gateway run-loop drain budget: `DRAIN_TIMEOUT_MS` **30s → 90s**. - On SIGUSR1 restart in the gateway run loop: 1. **Abort compacting embedded runs** (best-effort, surgical — only runs where `handle.isCompacting()` is true) 2. Drain **both** active command-queue tasks and active embedded runs (up to 90s) 3. If drain times out, **abort all embedded runs** (best-effort) and proceed ### 3) Exception-safe abort handling - All three `abortEmbeddedPiRun()` code paths (single session, compacting-only, abort-all) now wrap `handle.abort()` in try/catch. - A thrown abort handler no longer prevents `server.close()` from executing during SIGUSR1 restart drain. - Failed aborts are logged at `warn` level with the session ID and error. ## Why this is safe - The aggregate timeout only affects cases where compaction retry waits would otherwise stall; normal compaction runs are unchanged. - Restart already implies disruption; aborting compacting runs is a targeted best-effort to ensure session locks are released. - Exception guards on abort are strictly additive — they only prevent cascading failures. - Changes are localized to restart/compaction wait paths. ## Tests New: - `src/agents/pi-embedded-runner/run/compaction-retry-aggregate-timeout.e2e.test.ts` — aggregate timeout helper - `src/agents/pi-embedded-runner/runs.e2e.test.ts` — surgical abort + `waitForActiveEmbeddedRuns` Updated: - `src/cli/gateway-cli/run-loop.test.ts` — drain timeout constant - `src/infra/infra-runtime.test.ts` — deferral timeout constant Note: e2e tests require `-c vitest.e2e.config.ts` (excluded from default config). ```bash pnpm -s lint pnpm -s format:check npx vitest run src/agents/pi-embedded-runner/ src/cli/gateway-cli/run-loop.test.ts src/infra/infra-runtime.test.ts --reporter=verbose npx vitest run -c vitest.e2e.config.ts src/agents/pi-embedded-runner/runs.e2e.test.ts src/agents/pi-embedded-runner/run/compaction-retry-aggregate-timeout.e2e.test.ts --reporter=verbose ``` ## Notes / follow-ups - This PR does **not** change per-attempt compaction safety timeouts (`EMBEDDED_COMPACTION_TIMEOUT_MS`); if desired, we can follow up with a smaller per-attempt ceiling and/or a compaction re-entry cooldown once we have more production data. - Refactor commits (`dedupe compaction failure`, `dedupe assistant delta parsing`, `reuse isPidAlive`) are included to reduce code duplication touched during the investigation. <!-- greptile_comment --> <h3>Greptile Summary</h3> This PR fixes two related "session hung" failure modes in the PI embedded runner: - **Aggregate timeout for compaction retry waits**: Adds a 60s aggregate timeout to `waitForCompactionRetry()` in the embedded run attempt loop, preventing unbounded blocking of the session lane. On timeout, the run proceeds with the pre-compaction snapshot. The new `waitForCompactionRetryWithAggregateTimeout` helper uses `Promise.race` with proper timer cleanup and defensive normalization for non-finite timeout values. - **Hardened SIGUSR1 restart against in-flight compaction**: The gateway run loop now surgically aborts compacting embedded runs before draining, waits for both active tasks and embedded runs in parallel (up to 90s), and falls back to aborting all runs if the drain times out. The deferral budget and drain timeout are increased from 30s to 90s to match observed production compaction durations. - **Exception-safe abort handling**: All `abortEmbeddedPiRun` code paths now wrap `handle.abort()` in try/catch, preventing a thrown abort handler from cascading into the shutdown sequence. Changes are well-localized to restart/compaction paths, and existing behavior for normal compaction runs is unchanged. Test coverage is thorough with new e2e tests for the aggregate timeout helper and surgical abort modes. <h3>Confidence Score: 5/5</h3> - This PR is safe to merge — changes are defensive, well-scoped to restart/compaction paths, and thoroughly tested. - The changes are strictly additive safety improvements: an aggregate timeout that only fires when compaction would otherwise stall indefinitely, exception guards on abort that only prevent cascading failures, and targeted abort of compacting runs during restart. Normal compaction flows are unaffected. All new behavior paths have dedicated e2e tests, existing tests are updated to match the new timeout constants, and the code follows established patterns in the codebase (polling drain, try/catch abort, Promise.race with cleanup). No logical or security issues found after thorough analysis of all 9 changed files and their related imports. - No files require special attention <sub>Last reviewed commit: 6d7ef5a</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs