#9085: fix: improve stability for terminated responses and telegram retries
agents
stale
Cluster:
Error Handling in Agent Tools
## Summary
Fixes for stability issues encountered during extended agent sessions.
### Changes
1. **Strip partial toolCalls from error/terminated assistants** (`session-transcript-repair.ts`, `images.ts`)
- When API returns `stopReason=error` with partial toolCall blocks, strip them to prevent orphaned tool_use/tool_result pairing
- Anthropic API rejects mismatched pairs, causing session crashes
2. **Auto-retry on 'terminated' errors** (`run.ts`)
- API sometimes returns `terminated` with minimal output (transient issue)
- Auto-retry up to 2 times with exponential backoff (2s, 4s)
3. **Improved Telegram retry defaults** (`retry-policy.ts`)
- Increased attempts: 3 → 5
- Increased max delay: 30s → 60s
- Added patterns: `fetch failed`, `network`
- Helps recover from network interruptions (sleep/wake, WiFi drops)
4. **Re-sanitize after context rebuild** (`attempt.ts`)
- Ensures tool pairing stays valid after orphan removal
### Testing
These fixes were tested over multiple days of real usage with long-running sessions, resolving:
- `terminated` errors that previously required manual session restart
- Tool pairing mismatches after context compaction
- Telegram send failures after Mac sleep/wake cycles
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR hardens long-running embedded agent sessions by (1) stripping partial tool-call blocks from assistant turns with `stopReason: "error"` to prevent tool_use/tool_result pairing rejections, (2) retrying transient Anthropic `terminated` responses with backoff, (3) increasing Telegram retry defaults and broadening retryable error patterns, and (4) re-sanitizing session context after rebuilding it to maintain tool pairing.
The changes touch transcript sanitization (`session-transcript-repair.ts`, `images.ts`), the embedded runner retry loop (`run.ts`), and session-context rebuild handling (`run/attempt.ts`), plus shared retry policy defaults (`retry-policy.ts`).
<h3>Confidence Score: 3/5</h3>
- This PR is close to mergeable but has a couple of transcript-handling edge cases that can drop error turns or skew retry behavior.
- Core intent (avoid mismatched tool_use/tool_result pairs; retry transient terminated errors; broaden Telegram retry) is sound, but the new logic can delete assistant error turns entirely and the terminated retry counter is not scoped to a single failure burst, which can lead to confusing behavior in extended sessions.
- src/agents/pi-embedded-helpers/images.ts, src/agents/session-transcript-repair.ts, src/agents/pi-embedded-runner/run.ts
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#15050: fix: transcript corruption resilience — strip aborted tool_use bloc...
by yashchitneni · 2026-02-12
86.3%
#7525: Agents: skip errored tool calls during pairing
by justinhuangcode · 2026-02-02
85.9%
#4844: fix(agents): skip error/aborted assistant messages in transcript re...
by lailoo · 2026-01-30
85.8%
#14328: fix: strip incomplete tool_use blocks from errored/aborted messages...
by Kropiunig · 2026-02-12
84.5%
#3362: fix: auto-repair and retry on orphan tool_result errors
by samhotchkiss · 2026-01-28
84.3%
#12487: fix(agents): strip orphaned tool_result when tool_use is sanitized ...
by skylarkoo7 · 2026-02-09
83.8%
#8270: fix: support snake_case 'tool_use' in transcript repair (#8264)
by heliosarchitect · 2026-02-03
83.6%
#3647: fix: sanitize tool arguments in session history
by nhangen · 2026-01-29
82.5%
#9416: fix: drop errored/aborted assistant tool pairs in transcript repair
by xandorklein · 2026-02-05
82.3%
#21195: fix: suppress orphaned tool_use/tool_result errors after session co...
by ruslansychov-git · 2026-02-19
81.9%