← Back to PRs

#18432: fix(agents): clear active run state immediately on embedded timeout

by BinHPdev open 2026-02-16 18:13 View on GitHub →
channel: mattermost agents size: S
## Summary - Problem: Embedded agent run timeout leaves session in active state indefinitely, blocking announce queue - Why it matters: All subsequent cron/heartbeat deliveries fail permanently, even after gateway restart (sqlite persistence) - What changed: Timeout handler now immediately clears active run state instead of waiting for finally block - What did NOT change: Normal (non-timeout) cleanup path unchanged; double-clear is safe due to handle check ## Change Type (select all) - [x] Bug fix - [ ] Feature - [ ] Refactor - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [x] Gateway / orchestration - [ ] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [ ] Integrations - [ ] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Closes #18302 ## User-visible / Behavior Changes After an embedded agent timeout, the session will immediately transition to idle state and the announce queue will drain normally. Users will no longer experience permanent deadlock requiring manual sqlite deletion. ## Security Impact (required) - New permissions/capabilities? No - Secrets/tokens handling changed? No - New/changed network calls? No - Command/tool execution surface changed? No - Data access scope changed? No ## Repro + Verification ### Environment - OS: macOS 15.3.1 (from issue report) - Runtime: Node v25.6.1 - Model: ollama/MiniMax-M2.5-MLX-8bit (slow local model) - Configuration: Cron job triggering embedded agent ### Steps 1. Configure cron job with embedded agent run 2. Have LLM backend respond slowly (>300s timeout) 3. Observe timeout log 4. Verify session transitions to idle 5. Subsequent cron/heartbeat deliveries succeed ### Expected Session transitions to idle after timeout, queue drains normally ### Actual Session immediately transitions to idle (fixed behavior) ## Evidence - [x] Code review shows clearActiveEmbeddedRun now called in timeout handler - [x] All embedded runner tests pass (36 tests) - [x] Build and lint checks pass ## Human Verification (required) What I verified: - Timeout handler logic updated for both internal and external abort signals - clearActiveEmbeddedRun safe to call multiple times (handle comparison) - All embedded runner tests pass - Build and type checks pass Edge cases checked: - Double-clear is safe (handle mismatch check in clearActiveEmbeddedRun) - External abort signal timeout also handled - Comments clarify the fix and reference issue What I did **not** verify: - Live cron job timeout scenario with sqlite persistence (don't have slow model setup) - Announce queue drain after timeout recovery ## Compatibility / Migration - Backward compatible? Yes - Config/env changes? No - Migration needed? No ## Failure Recovery (if this breaks) - How to disable/revert: Revert this commit - Files/config to restore: None - Known bad symptoms: Session might clear prematurely if handle logic has bugs (unlikely) ## Risks and Mitigations - Risk: clearActiveEmbeddedRun called twice might cause confusion - Mitigation: Added comment explaining safe double-call; handle check prevents actual double-clear - Risk: Immediate clear during timeout might race with finally block cleanup - Mitigation: clearActiveEmbeddedRun has handle comparison to prevent issues <!-- greptile_comment --> <h3>Greptile Summary</h3> This PR fixes a critical bug where embedded agent timeouts left sessions in an indefinitely active state, blocking the announce queue and causing all subsequent cron/heartbeat deliveries to fail. The fix adds immediate `clearActiveEmbeddedRun` calls in both the internal timeout handler (line 803) and external abort timeout handler (line 837), ensuring the session transitions to idle state as soon as the timeout occurs rather than waiting for the finally block which may be delayed by compaction or other blocking operations. The implementation is safe due to handle comparison in `clearActiveEmbeddedRun` (runs.ts:141) that prevents double-clear issues. **Key changes:** - Added immediate active run state clearing in timeout handlers - Documented the fix with detailed comments explaining the issue and solution - Safe to call multiple times due to handle mismatch check in `clearActiveEmbeddedRun` **Note:** CHANGELOG.md has a duplicate entry on lines 17-18 (already flagged in previous threads) <h3>Confidence Score: 5/5</h3> - This PR is safe to merge with high confidence - it fixes a critical bug with a minimal, well-understood change - The fix is surgically targeted to the specific issue (timeout leaving session active), uses existing safe mechanisms (handle comparison prevents double-clear), includes comprehensive documentation explaining the problem and solution, and follows the repository's coding standards. The change is minimal (3 added clearActiveEmbeddedRun calls), doesn't alter any control flow or data structures, and leverages existing safety guarantees in the clearActiveEmbeddedRun function. The only issue is a non-critical CHANGELOG duplicate that was already flagged. - CHANGELOG.md needs minor cleanup (duplicate entry), but the core fix in attempt.ts is solid <sub>Last reviewed commit: 1eb55ae</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs