#9085: fix: improve stability for terminated responses and telegram retries

by vladdick88 open 2026-02-04 21:14 View on GitHub →

agents stale

## Summary Fixes for stability issues encountered during extended agent sessions. ### Changes 1. **Strip partial toolCalls from error/terminated assistants** (`session-transcript-repair.ts`, `images.ts`) - When API returns `stopReason=error` with partial toolCall blocks, strip them to prevent orphaned tool_use/tool_result pairing - Anthropic API rejects mismatched pairs, causing session crashes 2. **Auto-retry on 'terminated' errors** (`run.ts`) - API sometimes returns `terminated` with minimal output (transient issue) - Auto-retry up to 2 times with exponential backoff (2s, 4s) 3. **Improved Telegram retry defaults** (`retry-policy.ts`) - Increased attempts: 3 → 5 - Increased max delay: 30s → 60s - Added patterns: `fetch failed`, `network` - Helps recover from network interruptions (sleep/wake, WiFi drops) 4. **Re-sanitize after context rebuild** (`attempt.ts`) - Ensures tool pairing stays valid after orphan removal ### Testing These fixes were tested over multiple days of real usage with long-running sessions, resolving: - `terminated` errors that previously required manual session restart - Tool pairing mismatches after context compaction - Telegram send failures after Mac sleep/wake cycles  <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR hardens long-running embedded agent sessions by (1) stripping partial tool-call blocks from assistant turns with `stopReason: "error"` to prevent tool_use/tool_result pairing rejections, (2) retrying transient Anthropic `terminated` responses with backoff, (3) increasing Telegram retry defaults and broadening retryable error patterns, and (4) re-sanitizing session context after rebuilding it to maintain tool pairing. The changes touch transcript sanitization (`session-transcript-repair.ts`, `images.ts`), the embedded runner retry loop (`run.ts`), and session-context rebuild handling (`run/attempt.ts`), plus shared retry policy defaults (`retry-policy.ts`). <h3>Confidence Score: 3/5</h3> - This PR is close to mergeable but has a couple of transcript-handling edge cases that can drop error turns or skew retry behavior. - Core intent (avoid mismatched tool_use/tool_result pairs; retry transient terminated errors; broaden Telegram retry) is sound, but the new logic can delete assistant error turns entirely and the terminated retry counter is not scoped to a single failure burst, which can lead to confusing behavior in extended sessions. - src/agents/pi-embedded-helpers/images.ts, src/agents/session-transcript-repair.ts, src/agents/pi-embedded-runner/run.ts  <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>