#15050: fix: transcript corruption resilience — strip aborted tool_use blocks, isolate format error cooldowns

by yashchitneni open 2026-02-12 22:38 View on GitHub →

agents stale size: M

## Summary Three targeted fixes for [#15037](https://github.com/openclaw/openclaw/issues/15037) — corrupted session transcripts crashing boot and cascading auth cooldowns to all channels. ## Problem When the gateway crashes or receives SIGUSR1 mid-tool-call, the session transcript can end up with assistant messages containing tool_use blocks that have no matching tool_results. On next boot, the Anthropic API rejects the malformed history, which puts the auth profile into cooldown, which cascades to block **all** channels (Slack, Telegram, webchat). ## Fixes ### Fix 1: Strip tool_use blocks from aborted/errored assistant messages **File:** `src/agents/session-transcript-repair.ts` The existing code skipped repair for aborted/errored messages but **still kept them with their tool_use blocks**. The API then expects matching tool_results that don't exist → 400 error. **Fix:** Strip tool_use blocks from aborted/errored assistant messages entirely. If the message has text content alongside tool calls, keep the text. If all content was tool calls, drop the message. ### Fix 2: Don't cooldown auth profiles for format errors **File:** `src/agents/pi-embedded-runner/run.ts` Format errors (400) indicate malformed session input, not provider unavailability. Previously they triggered `markAuthProfileFailure()` which put the profile into exponential backoff cooldown — affecting every session sharing that profile. **Fix:** Skip `markAuthProfileFailure()` when `promptFailoverReason === "format"`. The failing session gets its error, other sessions continue working. ### Fix 3: Updated tests **File:** `src/agents/session-transcript-repair.test.ts` Updated existing tests to reflect new behavior (tool_use blocks stripped from aborted messages instead of passed through). Added new test case for mixed content (text + tool_use) in errored messages. ## Testing - Updated 3 existing test cases - Added 1 new test case for mixed content in errored messages - All changes are backward-compatible: only aborted/errored messages are affected ## Impact - **Before:** One corrupted session takes down all channels (Slack, Telegram, webchat) - **After:** Corrupted sessions self-heal on load; format errors don't cascade to other sessions  <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR addresses session transcript corruption and auth-profile cooldown cascades by: - Updating transcript repair (`src/agents/session-transcript-repair.ts`) to strip `toolCall`/`toolUse` blocks from assistant messages that ended with `stopReason: "error"` or `"aborted"` (and drop the message if nothing remains), preventing strict providers from rejecting history due to unmatched tool calls. - Adjusting the embedded runner (`src/agents/pi-embedded-runner/run.ts`) to avoid placing shared auth profiles into cooldown for format errors (`promptFailoverReason === "format"`), keeping format/corruption failures session-scoped. - Updating and expanding unit tests (`src/agents/session-transcript-repair.test.ts`) to validate the new stripping/drop behavior, including mixed text + tool-call content. Overall, the changes fit the existing repair pipeline by keeping “normal” tool-call pairing repairs intact while making aborted/error turns safe to submit to strict APIs and reducing cross-session blast radius from malformed-history 400s. <h3>Confidence Score: 4/5</h3> - This PR is close to safe to merge, with one correctness issue in transcript repair reporting/rewriting behavior. - Core logic changes are targeted and align with existing failover/cooldown handling, but the new aborted/error stripping path unconditionally marks the transcript as changed for any array-content assistant message, which can cause unnecessary rewrites and incorrect `moved` reporting even when no tool blocks existed. - src/agents/session-transcript-repair.ts <sub>Last reviewed commit: fb8862b</sub>  <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>