#15050: fix: transcript corruption resilience — strip aborted tool_use blocks, isolate format error cooldowns
agents
stale
size: M
Cluster:
Error Handling in Agent Tools
## Summary
Three targeted fixes for [#15037](https://github.com/openclaw/openclaw/issues/15037) — corrupted session transcripts crashing boot and cascading auth cooldowns to all channels.
## Problem
When the gateway crashes or receives SIGUSR1 mid-tool-call, the session transcript can end up with assistant messages containing tool_use blocks that have no matching tool_results. On next boot, the Anthropic API rejects the malformed history, which puts the auth profile into cooldown, which cascades to block **all** channels (Slack, Telegram, webchat).
## Fixes
### Fix 1: Strip tool_use blocks from aborted/errored assistant messages
**File:** `src/agents/session-transcript-repair.ts`
The existing code skipped repair for aborted/errored messages but **still kept them with their tool_use blocks**. The API then expects matching tool_results that don't exist → 400 error.
**Fix:** Strip tool_use blocks from aborted/errored assistant messages entirely. If the message has text content alongside tool calls, keep the text. If all content was tool calls, drop the message.
### Fix 2: Don't cooldown auth profiles for format errors
**File:** `src/agents/pi-embedded-runner/run.ts`
Format errors (400) indicate malformed session input, not provider unavailability. Previously they triggered `markAuthProfileFailure()` which put the profile into exponential backoff cooldown — affecting every session sharing that profile.
**Fix:** Skip `markAuthProfileFailure()` when `promptFailoverReason === "format"`. The failing session gets its error, other sessions continue working.
### Fix 3: Updated tests
**File:** `src/agents/session-transcript-repair.test.ts`
Updated existing tests to reflect new behavior (tool_use blocks stripped from aborted messages instead of passed through). Added new test case for mixed content (text + tool_use) in errored messages.
## Testing
- Updated 3 existing test cases
- Added 1 new test case for mixed content in errored messages
- All changes are backward-compatible: only aborted/errored messages are affected
## Impact
- **Before:** One corrupted session takes down all channels (Slack, Telegram, webchat)
- **After:** Corrupted sessions self-heal on load; format errors don't cascade to other sessions
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR addresses session transcript corruption and auth-profile cooldown cascades by:
- Updating transcript repair (`src/agents/session-transcript-repair.ts`) to strip `toolCall`/`toolUse` blocks from assistant messages that ended with `stopReason: "error"` or `"aborted"` (and drop the message if nothing remains), preventing strict providers from rejecting history due to unmatched tool calls.
- Adjusting the embedded runner (`src/agents/pi-embedded-runner/run.ts`) to avoid placing shared auth profiles into cooldown for format errors (`promptFailoverReason === "format"`), keeping format/corruption failures session-scoped.
- Updating and expanding unit tests (`src/agents/session-transcript-repair.test.ts`) to validate the new stripping/drop behavior, including mixed text + tool-call content.
Overall, the changes fit the existing repair pipeline by keeping “normal” tool-call pairing repairs intact while making aborted/error turns safe to submit to strict APIs and reducing cross-session blast radius from malformed-history 400s.
<h3>Confidence Score: 4/5</h3>
- This PR is close to safe to merge, with one correctness issue in transcript repair reporting/rewriting behavior.
- Core logic changes are targeted and align with existing failover/cooldown handling, but the new aborted/error stripping path unconditionally marks the transcript as changed for any array-content assistant message, which can cause unnecessary rewrites and incorrect `moved` reporting even when no tool blocks existed.
- src/agents/session-transcript-repair.ts
<sub>Last reviewed commit: fb8862b</sub>
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#14328: fix: strip incomplete tool_use blocks from errored/aborted messages...
by Kropiunig · 2026-02-12
88.4%
#4844: fix(agents): skip error/aborted assistant messages in transcript re...
by lailoo · 2026-01-30
87.8%
#12487: fix(agents): strip orphaned tool_result when tool_use is sanitized ...
by skylarkoo7 · 2026-02-09
87.6%
#8270: fix: support snake_case 'tool_use' in transcript repair (#8264)
by heliosarchitect · 2026-02-03
87.3%
#14368: fix: skip auth profile cooldown on format errors to prevent provide...
by koatora20 · 2026-02-12
87.2%
#8345: fix: prevent synthetic error repair from creating tool_result for d...
by vishaltandale00 · 2026-02-03
86.9%
#9416: fix: drop errored/aborted assistant tool pairs in transcript repair
by xandorklein · 2026-02-05
86.8%
#9085: fix: improve stability for terminated responses and telegram retries
by vladdick88 · 2026-02-04
86.3%
#6687: fix(session-repair): strip malformed tool_use blocks to prevent per...
by NSEvent · 2026-02-01
85.9%
#16966: fix: strip tool_use blocks from aborted/errored assistant messages
by StressTestor · 2026-02-15
85.0%